Closed hbredin closed 2 years ago
I have just added a Mix
transform as an example.
A good next step would be to have it return an ObjectDict (see https://github.com/asteroid-team/torch-audiomentations/pull/126) instead of tensor or tuple
Will do.
What do you think of automatically inferring target_rate
from num_samples = samples.shape[2]
, sample_rate
, and num_frames == target.shape[2]
?
target_rate = sample_rate * num_frames / num_samples
with the num_frames = 1
corner case where target_rate
does not even exist (one label for the whole sample).
I would prefer to have it explicitly provided by the user. Here's an example that explains why:
Yamnet has a rate of 2,0833 hz (1 classification output for every step of 480 ms). In this example the rate cannot be accurately inferred, especially if the audio is short, because e.g. 1,1 seconds and 1,3 seconds of audio will give the same number of frames.
I would prefer to have it explicitly provided by the user. Here's an example that explains why:
Yamnet has a rate of 2,0833 hz (1 classification output for every step of 480 ms). In this example the rate cannot be accurately inferred, especially if the audio is short, because e.g. 1,1 seconds and 1,3 seconds of audio will give the same number of frames.
Noted.
When target_rate
is required but missing, can we still do our best to infer it automatically and warn the user about it?
warnings.warn(
f"target_rate is required by {self.__class__.__name__}. "
f"It has been automatically inferred from targets shape to {inferred_target_rate} "
f"If this is incorrect, please use target_rate to pass it directly.")
Yes, I'm okay with that
Updated with the API discussed in Slack.
Thanks :) I'll have a look soon-ish. Thanks for your patience.
This is starting to look quite good already!
Here's what I think needs to be done:
Before it gets merged:
After it gets merged:
Before it gets merged:
* Mix deserves a unit test or two. Can you make that happen @hbredin?
I definitely can (and will) but I am having trouble thinking about what to test.
I can do the equivalent of AddBackgroundNoise
's test_varying_snr_within_batch
.
Do you think of a test focusing on targets?
I definitely can (and will) but I am having trouble thinking about what to test.
I can do the equivalent of
AddBackgroundNoise
'stest_varying_snr_within_batch
. Do you think of a test focusing on targets?
At the very least I like to have the shape and the dtype of the output tested. And that the output isn't the same as the output. But of course, smart test assertions are useful 👍
Thanks for the contribution 🚀
Thanks for merging this!
After it gets merged:
* Add an output_format parameter that defaults to the old-style API, so we can release at least one version with backwards-compatible API (and then have the new API style be opt-in)
I can take care of it but can you please clarify the API?
* Have a demo script that processes targets as well as the usual inputs. Maybe also document some examples of how to use the targets feature
I can easily create a notebook showing how I use this new feature in pyannote.audio. Would that be OK or do you want this demo to be 100% torch_audiomentation?
* Update changelog
I suggest you do that :)
I can take care of it but can you please clarify the API?
Great! Hmm, how about something like this, for example:
transform = PolarityInversion()
transform(my_audio) # outputs a tensor, as before. Also emit a FutureWarning with info on the new output_type arg and a recommended on transitioning to object_dict
transform = PolarityInversion(output_type="tensor")
transform(my_audio) # outputs a tensor, as before. Maybe also emit a deprecation warning?
transform = PolarityInversion(output_type="object_dict")
transform(my_audio) # outputs an ObjectDict
I can easily create a notebook showing how I use this new feature in pyannote.audio. Would that be OK or do you want this demo to be 100% torch_audiomentation?
Sounds good to me :) If the notebook is large, consider using git LFS
This is an initial attempt at addressing #3. The proposed API assumes the following shapes:
(batch_size, num_channels, num_samples)
(batch_size, num_channels, num_frames, num_classes)
num_frames
num_classes
num_samples
and is used like that