Add PitchShift operation

asteroid-team / torch-audiomentations

Fast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.

MIT License

969 stars 88 forks source link

Add PitchShift operation #82

Closed KentoNishi closed 3 years ago

KentoNishi commented 3 years ago

This PR is still a work in progress, but here is the gist of it:

I made a separate library called torch-pitch-shift that calculates pitch-shift targets that can be computed quickly using tensor operations beforehand. The PitchShift class basically just calls this library to transform each sample in the batch.
One caveat is that the current implementation requires the sample rate to be passed during the Compose process. I'm sure yall can figure something out, the current class is just a proof of concept.
I haven't added any tests. Please help! (the only demo file present is test.py at the root of the repo, which I'll have to delete later.)

The library I made is still undocumented, so I don't want to make it public just yet. If you want to verify that the code is not malicious, let me know so I can add you to the repo!

KentoNishi commented 3 years ago

It's in a working and semi-usable state!

Input

# TODO: WRITE SOME REAL TESTS

import torch
from torch_audiomentations import Compose, PitchShift

# Initialize augmentation callable
apply_augmentation = Compose(
    transforms=[
        PitchShift(16000, p=1),
    ]
)

torch_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Make an example tensor with white noise.
# This tensor represents 8 audio snippets with 2 channels (stereo) and 2 s of 16 kHz audio.
import time

audio_samples = (
    torch.rand(size=(8, 2, 32000), dtype=torch.float32, device=torch_device) - 0.5
)

# Apply augmentation. This varies the gain and polarity of (some of)
# the audio snippets in the batch independently.
start = time.process_time()
perturbed_audio_samples = apply_augmentation(audio_samples, sample_rate=16000)
print(time.process_time() - start)

Output

1.53125

Note: code was run on gpu!

KentoNishi commented 3 years ago

i haven't implemented any tests, I'll leave that up to you guys :)

plz let me know what you think of what I have so far!

iver56 commented 3 years ago

Thanks for making this, and thank you for your patience. I'll have a look at this soon 👍

KentoNishi commented 3 years ago

i tried using it in my own project, and it seems like there's a memory leak somewhere? vram usage keeps increasing when i include pitch shift. will investigate.

iver56 commented 3 years ago

Interesting. The memory leak should ideally be fixed before we merge this :)

KentoNishi commented 3 years ago

Thanks for the reviews, will take a look soon :)

KentoNishi commented 3 years ago

Alright gonna get some sleep now, good night 😴

KentoNishi commented 3 years ago

Adding Batched_Pitch_Shift as a separate class might be a good idea. Currently working on batched shifting in the library itself :)

KentoNishi commented 3 years ago

Implementation of batches transforms is done in the library! Will update the fork when I have time later.

KentoNishi commented 3 years ago

timed in seconds. really liking how fast it runs! this is with 8 samples with sr=16000 and clips each 2 seconds long

KentoNishi commented 3 years ago

@iver56 what's new:

wrote tests, they pass now :)
added 3 modes: per_batch, per_example, and per_channel. per_batch is default cuz it's fast
added assertions and error messages

iver56 commented 3 years ago

Thanks for the improvements 👍 I will re-review this soon-ish (my availability is a bit limited, thanks for your patience)

iver56 commented 3 years ago

I would like "per_example" to be the default mode. Although "per_batch" is faster, variation within each batch is typically a good idea when training models :)

iver56 commented 3 years ago

The other transforms have "per_example" as the default mode too

iver56 commented 3 years ago

Something happened to the commits after the force push - I can't see the commits in the pull request anymore

iver56 commented 3 years ago

I think would also prefer a bit more modest default parameters - pitch shifting a whole octave up or down is a bit extreme. In audiomentations the default is -4 to +4 semitones. -4 semitones is down a third of an octave, and +4 semitones is up a third of an octave. This default would give a range of two thirds of an octave.

In audiomentations, the pitch shifting parameters are input as semitones. Could that be relevant here too? I personally find it easier to relate to the numbers when they are given in semitones (e.g. -12 and +12) instead of fractions (e.g. 0.5 and 2.0)

KentoNishi commented 3 years ago

@iver56 I think what happened is that the force push overwrote all my commits. I'll patch it up and open a new PR