asteroid-team / torch-audiomentations

Fast audio data augmentation in PyTorch. Inspired by audiomentations. Useful for deep learning.
MIT License
928 stars 87 forks source link

Loading multiple files into a tensor? #168

Closed asusdisciple closed 9 months ago

asusdisciple commented 9 months ago

I am a bit unfamiliar with your approach of loading data. As far as I understand I need to load multiple files into a multi-dimensional tensor in order to process them in batch. After that of course I have to deconstruct the tensor again and save it into multiple files. How can I do that? I started loading the files in a loop with torchaudio.load() but I am not sure how to combine them into a single tensor which I can then call from the api.

iver56 commented 9 months ago

If it is easier for you to get a proof of concept working, you can use tensors with a batch size of 1. That means you give it one sound as input.

asusdisciple commented 9 months ago

What I do at the moment for a single file is this:


import torchaudio
import torch
from torch_audiomentations import Compose, AddBackgroundNoise, PeakNormalization, AddColoredNoise

# Initialize augmentation callable
apply_augmentation = Compose(
    transforms=[
        PeakNormalization(p=1.0),
        AddBackgroundNoise(p=1.0,
                           background_paths="...",
                           min_snr_in_db=0,
                           max_snr_in_db=25),
        AddColoredNoise(p=0.3),

    ]
)

torch_device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Make an example tensor with white noise.
# This tensor represents 8 audio snippets with 2 channels (stereo) and 2 s of 16 kHz audio.
files = ["..."]
for file in files:
    f = torchaudio.load(file)
    f = f[0][0][None, None, :]
    perturbed_audio_samples = apply_augmentation(samples=torch.tensor(f), sample_rate=16000)

    torchaudio.save(".../test.wav", perturbed_audio_samples[0], sample_rate=16000)

Of course what I rather like to do is call torchaudio.load() in a loop and save all files into a single tensor which I then augment. After that I would like to loop over one dimension of the tensor and save each file separately. However I cant reproduce the example with a tensor of size=(x, y, z) where x is the number of files, y=1 for mono, and z the data within each audio which can also be of different length due to the duration.

iver56 commented 9 months ago

If I were you, I would initialize a "placeholder" tensor and then insert each audio snippet into it, one by one

asusdisciple commented 9 months ago

But how do you do that? If I use torch.tensor(size=(20,1, 320000))for example (20 files, mono, 20sec of 16khz audio) the last dimension is always fixed.

iver56 commented 9 months ago

Yes, that should do it. An underlying assumption when doing batch processing like this is that all your audio snippets have the same length. If not, you could zero-pad or trim the audio snippets so their length matches. Often, when training a machine learning model that inputs audio, it needs to have the same length for all the audio snippets in the batch due to the way tensors and common model architectures work. That's where this assumption comes from.

If your use case is different, maybe batch processing isn't for you?

asusdisciple commented 9 months ago

I implemented it the way you proposed. Interestingly there is no performance gain. The reason seems to be that the performance bottlebeck is torch.save() in my case, because I have more than 70.000 small audio files to process. Anyways if somebody comes across the same problem, dont bother with loading everything in multiple big tensors, the performance is the same.

iver56 commented 9 months ago

the performance is the same

It depends on the transforms, the batch size, the audio length and the hardware. E.g. ApplyImpulseResponse and LowPassFilter will typically be faster when using batched compute on GPU. But I believe you when you say that in your case the perf is the same 👍

asusdisciple commented 9 months ago

To be honest I am not sure if it is calculated on the GPU anymore. I see a spike in memory usage but the Utilization stays at 0% all the time. If you want to check it out, here is my code. It works but the call of apply_augmentation() is very slow. I dont know if I have to initialize the device differentlyfor cuda?

Maybe the reason is that you still have to load the background_noise files each time? They are also very small and about 10k samples. So each time a file is augmented a file needs to be loaded.

import torchaudio
import torch
from torch_audiomentations import Compose, AddBackgroundNoise, PeakNormalization, AddColoredNoise
import os
from tqdm import tqdm
# Initialize augmentation callable
apply_augmentation = Compose(
    transforms=[
        PeakNormalization(p=1.0),
        AddBackgroundNoise(p=1.0,
                           background_paths="...",
                           min_snr_in_db=0,
                           max_snr_in_db=30),
        AddColoredNoise(p=0.3,
                        min_snr_in_db=5,
                        max_snr_in_db=30),

    ]
)

torch_device = "cuda:0"

# Get list of files
file_list = []
for root, dirs, files in os.walk("..."):
    for file in files:
        file_list.append(os.path.join(root, file))

file_list = file_list[0:1001]

file_list.sort()
size_num = 1000 # number of files per tensor
# Create empty tensor to be filled later
placeholder = torch.empty(size_num, 400000, dtype=torch.float32, device=torch_device)
cnt = 0
paddings = []
files = []
tensors = []

# Iterate through file list
for idx, file in tqdm(enumerate(file_list)):
    # When last file
    if idx == len(file_list) -1:
        # append to list of tuples(number_of_file, 1, padded_audio
        tensors.append((placeholder[:len(paddings), None, :], paddings, files))
        break

    # load file and get the padding, thn append to tmp
    tmp = torchaudio.load(file, normalize=False)[0]
    pad = 400000-tmp.shape[1]
    padded = torch.nn.ConstantPad1d((0, pad), 0)(tmp)

    # append to list for later use
    files.append(file)
    paddings.append(pad)
    placeholder[cnt, None, :] = padded

    cnt += 1
    # if x files are read into the tensor, append it to list reset lists
    if cnt == size_num:
        tensors.append((placeholder[:, None, :], paddings, files))
        placeholder = torch.empty(size_num, 400000, dtype=torch.float32, device=torch_device)
        paddings = []
        files = []
        cnt = 0

for tensor in tqdm(tensors):
    # pertube each giant tensor, very slow why?
    perturbed = apply_augmentation(samples=tensor[0], sample_rate=16000)

    for i in range(0, len(tensor[1])):
        sample = perturbed[i, :, :tensor[1][i]].cpu()
        torchaudio.save(tensor[2][i], sample, sample_rate=16000)
asusdisciple commented 9 months ago

So it is indeed calculated on GPU after some testing. So the bottleneck might be the augmentation of background_noise files, which need to be loaded for each tensor separately. E.g. if I have 1000 tensors in a big tensor (1000,1,file_length) and want to augment them all with a different file there will be some loading overhead. Maybe it would make sense to use Tensors or arrays for the AddBackgroundNoise transform instead of pathlists?

iver56 commented 9 months ago

Maybe 🤷 I hope you will be able to figure out a good solution. The code is open and permissively licensed, in case you want to try to improve it.

You could also consider audiomentations, which is more actively maintained and more battle-hardened: https://github.com/iver56/audiomentations