asteroid-team / torch-audiomentations

Fast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.
MIT License
969 stars 88 forks source link

Add PitchShift operation #82

Closed KentoNishi closed 3 years ago

KentoNishi commented 3 years ago

This PR is still a work in progress, but here is the gist of it:

The library I made is still undocumented, so I don't want to make it public just yet. If you want to verify that the code is not malicious, let me know so I can add you to the repo!

KentoNishi commented 3 years ago

It's in a working and semi-usable state!

Input

# TODO: WRITE SOME REAL TESTS

import torch
from torch_audiomentations import Compose, PitchShift

# Initialize augmentation callable
apply_augmentation = Compose(
    transforms=[
        PitchShift(16000, p=1),
    ]
)

torch_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Make an example tensor with white noise.
# This tensor represents 8 audio snippets with 2 channels (stereo) and 2 s of 16 kHz audio.
import time

audio_samples = (
    torch.rand(size=(8, 2, 32000), dtype=torch.float32, device=torch_device) - 0.5
)

# Apply augmentation. This varies the gain and polarity of (some of)
# the audio snippets in the batch independently.
start = time.process_time()
perturbed_audio_samples = apply_augmentation(audio_samples, sample_rate=16000)
print(time.process_time() - start)

Output

1.53125

Note: code was run on gpu!

KentoNishi commented 3 years ago

i haven't implemented any tests, I'll leave that up to you guys :)

plz let me know what you think of what I have so far!

iver56 commented 3 years ago

Thanks for making this, and thank you for your patience. I'll have a look at this soon 👍

KentoNishi commented 3 years ago

i tried using it in my own project, and it seems like there's a memory leak somewhere? vram usage keeps increasing when i include pitch shift. will investigate.

iver56 commented 3 years ago

Interesting. The memory leak should ideally be fixed before we merge this :)

KentoNishi commented 3 years ago

Thanks for the reviews, will take a look soon :)

KentoNishi commented 3 years ago

Alright gonna get some sleep now, good night 😴

KentoNishi commented 3 years ago

Adding Batched_Pitch_Shift as a separate class might be a good idea. Currently working on batched shifting in the library itself :)

KentoNishi commented 3 years ago

Implementation of batches transforms is done in the library! Will update the fork when I have time later.

KentoNishi commented 3 years ago

image

timed in seconds. really liking how fast it runs! this is with 8 samples with sr=16000 and clips each 2 seconds long

KentoNishi commented 3 years ago

@iver56 what's new:

iver56 commented 3 years ago

Thanks for the improvements 👍 I will re-review this soon-ish (my availability is a bit limited, thanks for your patience)

iver56 commented 3 years ago

I would like "per_example" to be the default mode. Although "per_batch" is faster, variation within each batch is typically a good idea when training models :)

iver56 commented 3 years ago

The other transforms have "per_example" as the default mode too

iver56 commented 3 years ago

Something happened to the commits after the force push - I can't see the commits in the pull request anymore

iver56 commented 3 years ago

I think would also prefer a bit more modest default parameters - pitch shifting a whole octave up or down is a bit extreme. In audiomentations the default is -4 to +4 semitones. -4 semitones is down a third of an octave, and +4 semitones is up a third of an octave. This default would give a range of two thirds of an octave.

In audiomentations, the pitch shifting parameters are input as semitones. Could that be relevant here too? I personally find it easier to relate to the numbers when they are given in semitones (e.g. -12 and +12) instead of fractions (e.g. 0.5 and 2.0)

KentoNishi commented 3 years ago

@iver56 I think what happened is that the force push overwrote all my commits. I'll patch it up and open a new PR