asteroid-team / torch-audiomentations

Fast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.
MIT License
969 stars 88 forks source link

BREAKING: add support for targets as discussed in #3 #123

Closed hbredin closed 2 years ago

hbredin commented 2 years ago

This is an initial attempt at addressing #3. The proposed API assumes the following shapes:

Input Target num_frames num_classes
Audio class(es), fixed length, not correlated with the input length. Low priority. 1 1 or more
Audio Time series of class(es). more than 1 1 or more
Audio Audio (length same as the input). num_samples 1

and is used like that

augment = MixSamples()
augmented = augment(samples, targets=targets)
mixed_samples = augmented.samples
mixed_targets = augmented.targets
hbredin commented 2 years ago

I have just added a Mix transform as an example.

iver56 commented 2 years ago

A good next step would be to have it return an ObjectDict (see https://github.com/asteroid-team/torch-audiomentations/pull/126) instead of tensor or tuple

hbredin commented 2 years ago

Will do.

hbredin commented 2 years ago

What do you think of automatically inferring target_rate from num_samples = samples.shape[2], sample_rate, and num_frames == target.shape[2]?

target_rate = sample_rate * num_frames / num_samples

with the num_frames = 1 corner case where target_rate does not even exist (one label for the whole sample).

iver56 commented 2 years ago

I would prefer to have it explicitly provided by the user. Here's an example that explains why:

Yamnet has a rate of 2,0833 hz (1 classification output for every step of 480 ms). In this example the rate cannot be accurately inferred, especially if the audio is short, because e.g. 1,1 seconds and 1,3 seconds of audio will give the same number of frames.

hbredin commented 2 years ago

I would prefer to have it explicitly provided by the user. Here's an example that explains why:

Yamnet has a rate of 2,0833 hz (1 classification output for every step of 480 ms). In this example the rate cannot be accurately inferred, especially if the audio is short, because e.g. 1,1 seconds and 1,3 seconds of audio will give the same number of frames.

Noted.

When target_rate is required but missing, can we still do our best to infer it automatically and warn the user about it?

warnings.warn(
    f"target_rate is required by {self.__class__.__name__}. "
    f"It has been automatically inferred from targets shape to {inferred_target_rate} "
    f"If this is incorrect, please use target_rate to pass it directly.")
iver56 commented 2 years ago

Yes, I'm okay with that

hbredin commented 2 years ago

Updated with the API discussed in Slack.

iver56 commented 2 years ago

Thanks :) I'll have a look soon-ish. Thanks for your patience.

iver56 commented 2 years ago

This is starting to look quite good already!

Here's what I think needs to be done:

Before it gets merged:

After it gets merged:

hbredin commented 2 years ago

Before it gets merged:

* Mix deserves a unit test or two. Can you make that happen @hbredin?

I definitely can (and will) but I am having trouble thinking about what to test.

I can do the equivalent of AddBackgroundNoise's test_varying_snr_within_batch. Do you think of a test focusing on targets?

iver56 commented 2 years ago

I definitely can (and will) but I am having trouble thinking about what to test.

I can do the equivalent of AddBackgroundNoise's test_varying_snr_within_batch. Do you think of a test focusing on targets?

At the very least I like to have the shape and the dtype of the output tested. And that the output isn't the same as the output. But of course, smart test assertions are useful 👍

iver56 commented 2 years ago

Thanks for the contribution 🚀

hbredin commented 2 years ago

Thanks for merging this!

After it gets merged:

* Add an output_format parameter that defaults to the old-style API, so we can release at least one version with backwards-compatible API (and then have the new API style be opt-in)

I can take care of it but can you please clarify the API?

* Have a demo script that processes targets as well as the usual inputs. Maybe also document some examples of how to use the targets feature

I can easily create a notebook showing how I use this new feature in pyannote.audio. Would that be OK or do you want this demo to be 100% torch_audiomentation?

* Update changelog

I suggest you do that :)

iver56 commented 2 years ago

I can take care of it but can you please clarify the API?

Great! Hmm, how about something like this, for example:

transform = PolarityInversion()
transform(my_audio)  # outputs a tensor, as before. Also emit a FutureWarning with info on the new output_type arg and a recommended on transitioning to object_dict

transform = PolarityInversion(output_type="tensor")
transform(my_audio)  # outputs a tensor, as before. Maybe also emit a deprecation warning?

transform = PolarityInversion(output_type="object_dict")
transform(my_audio)  # outputs an ObjectDict

I can easily create a notebook showing how I use this new feature in pyannote.audio. Would that be OK or do you want this demo to be 100% torch_audiomentation?

Sounds good to me :) If the notebook is large, consider using git LFS