Open iver56 opened 4 years ago
Nice! This is definitely THE feature that would make me switch to torch-audiomentations
, and more specifically the one with target "Time series of class(es)" which would come very handy for my work on speaker diarization.
We should make examples of what the API should look like in each scenario
Yeah, I agree that this would be super cool, but how would it look like? @hbredin do you have API ideas in mind?
The last one, "Audio (length same as the input)" is supported already by doing this:
augment = ExampleTransform(...)
augment(samples, sample_rate)
augment.freeze_parameters()
augment(other_samples, sample_rate)
augment.unfreeze_parameters()
We can refine the API for this scenario later
I am currently working on an augmentation technique that basically sums two samples selected randomly from within a batch, for training end-to-end speaker diarization models. This falls under the "Time-serie of class(es)" scenario.
What I am missing is a way to keep track of targets (in my case, frame-wise speaker activations). The API would look like this:
augment = SumTwoSamples(min_snr_in_db=0.0, max_snr_in_db=5.0)
augmented_samples, augmented_targets = augment(samples, sample_rate, targets=targets, target_rate=target_rate)
where
samples
has shape (batch_size, [num_channels,] num_samples)targets
has (batch_size, [num_channels,] num_frames, num_classes)-shapedaugmented_samples
has shape (new_batch_size, [num_channels,] num_samples)augmented_targets
has shape (new_batch_size, [num_channels,] num_frames, num_classes)I am trying to somehow bend BaseWaveformTransform
so that it accepts these new target*
optional arguments but it would probably make sense to create a new dedicated base class (e.g. BaseWaveformTransformWithTargets
) for that purpose.
In particular, it is not clear to me what per_samples
, per_channel
, per_batch
would mean for such transforms.
What are your thoughts?
cc @FrenchKrab
Interesting!
I get a feeling that it could blow up in complexity if we want to cover all use cases in one class. Maybe it makes sense to make separate "variants" of each transform? Or should we try to handle the various cases in one class?
I had a quick look at your draft PR, and my first thought was that it introduces breaking changes in all transforms (they return a tuple instead of a tensor), which then again led me to check out how computer vision data augmentation libs do it. For example, albumentations has the power to process a target (image/mask, bounding boxes or set of points) in the same way as the input image. In albumentations each transform returns a dictionary, which leaves some flexibility in what is included in the returned value. Maybe that's a useful pattern to adopt here?
# basic example without target
transformed = transform(image=image)
transformed_image = transformed["image"]
# basic example with a target that is a mask image
transformed = transform(image=image, mask=mask)
transformed_image = transformed['image']
transformed_mask = transformed['mask']
In particular, it is not clear to me what per_samples, per_channel, per_batch would mean for such transforms.
Could you elaborate on what feels unclear?
Anyway, I think it would make sense to first design the API, then implement it for a simple transform, like Shift or TimeInversion. Also, if we're going to make backwards-incompatible (breaking) changes, it would be nice to release a version with future warnings first, even though we're still in alpha 🤓
I get a feeling that it could blow up in complexity if we want to cover all use cases in one class. Maybe it makes sense to make separate "variants" of each transform? Or should we try to handle the various cases in one class?
We should probably start by listing the type of targets used in audio. I personally can't think of any that cannot be shaped as one of the three you mentioned in the original issue.
I had a quick look at your draft PR, and my first thought was that it introduces breaking changes in all transforms (they return a tuple instead of a tensor),
Actually, I made sure not to break the user facing API -- only the internal methods apply_transform
and randomize_parameters
have changed. I only had to change a few unit tests that were calling apply_transform
directly. Any tests that used the augment(samples, sample_rate)
API passed.
It only returns a tuple when the user passes the additional target=...
argument.
which then again led me to check out how computer vision data augmentation libs do it. For example, albumentations has the power to process a target (image/mask, bounding boxes or set of points) in the same way as the input image. In albumentations each transform returns a dictionary, which leaves some flexibility in what is included in the returned value. Maybe that's a useful pattern to adopt here?
It would also mean breaking the API right? I really like the current simple perturbed_samples = augment(samples)
.
In particular, it is not clear to me what per_samples, per_channel, per_batch would mean for such transforms.
Could you elaborate on what feels unclear?
I was refering to the original transform that made me think about this PR: SumTwoSamples
.
What would per_batch
mean for such a transform? It would return samples
unchanged, right?
Anyway, I think it would make sense to first design the API, then implement it for a simple transform, like Shift or TimeInversion. Also, if we're going to make backwards-incompatible (breaking) changes, it would be nice to release a version with future warnings first, even though we're still in alpha 🤓
As stated a few lines above, PR #123 does not make breaking changes, as in: "code that relies on the documented user facing API will still work". Custom augmentations would need to be updated though -- unless we add a supports_target
attribute that defaults to False and make sure to honor it in BaseWaveformTransform.forward
.
PR #123 does implement target
support for TimeInversion
.
We should probably start by listing the type of targets used in audio. I personally can't think of any that cannot be shaped as one of the three you mentioned in the original issue.
Agreed. So it comes down to the shape of the time series data then. I don't think I have any better suggestion than the shape you suggested in your PR.
Actually, I made sure not to break the user facing API
Good! I haven't looked at it in detail yet ^^ I might have time for that on friday. Thanks for your patience :)
It only returns a tuple when the user passes the additional target=... argument.
So the output type is either tensor or tuple, based on the parameters given? I'm not sure if this is a good pattern. Do you know of other python libraries that do it like that? Usually functions give the same output type (or sometimes None) regardless of the inputs.
It seems imgaug uses this pattern (returns a tuple of a target is provided):
seq = iaa.Sequential([
iaa.Crop(px=(0, 16)), # crop images from each side by 0 to 16px (randomly chosen)
iaa.Fliplr(0.5), # horizontally flip 50% of the images
iaa.GaussianBlur(sigma=(0, 3.0)) # blur images with a sigma of 0 to 3.0
])
for batch_idx in range(1000):
# 'images' should be either a 4D numpy array of shape (N, height, width, channels)
# or a list of 3D numpy arrays, each having shape (height, width, channels).
# Grayscale images must have shape (height, width, 1) each.
# All images must have numpy's dtype uint8. Values are expected to be in
# range 0-255.
images = load_batch(batch_idx)
images_aug = seq(images=images)
seq = iaa.Sequential([
iaa.Multiply((1.2, 1.5)), # change brightness, doesn't affect BBs
iaa.Affine(
translate_px={"x": 40, "y": 60},
scale=(0.5, 0.7)
) # translate by 40/60px on x/y axis, and scale to 50-70%, affects BBs
])
# Augment BBs and images.
image_aug, bbs_aug = seq(image=image, bounding_boxes=bbs)
albumentations on the other hand, always returns a dict
Maybe we should post a vote on those two options in Slack?
Inspired by albumentations - 9.9k stars, actively maintained
# Example without target
transform = Shift()
transformed_audio = transform(my_audio, sample_rate)["audio"]
# Example with target
transform = Shift()
transformed = transform(my_audio, sample_rate, target=my_target, target_rate=my_target_rate)
transformed_audio = transformed["audio"]
transformed_target = transformed["target"]
Inspired by imgaug - 12.4k stars, abandoned project
# Example without target
transform = Shift()
transformed_audio = transform(my_audio, sample_rate)
# Example with target
transform = Shift()
transformed_audio, transformed_target = transform(
my_audio, sample_rate, target=my_target, target_rate=my_target_rate
)
Voted!
I posted it in asteroid and thesoundofai. At the moment all 5 votes are in favor of dict. @mpariente suggested that we add support for attribute access
If I should conclude early, I think it's best to return an ObjectDict instead of a tuple or a tensor. ObjectDict behaves like a dict, but also supports access by attribute, so the two following lines would give the same result:
hello = my_object_dict.my_property
hello = my_object_dict["my_property"]
Adding a PR for that here
The idea is that whatever perturbations are applied to the input are reflected in the target data
Scenarios: