dgsmith1988 / ECSE-552-Final-Project

0 stars 1 forks source link

Librosa resampling question #1

Open dgsmith1988 opened 2 years ago

dgsmith1988 commented 2 years ago

https://github.com/dgsmith1988/ECSE-552-Final-Project/blob/9cedc77a11f532e0a7171f5aa4608921b6e9ed81/Code/audio_data_loader.py#L38

It seems as if Librosa already handles things like mono conversion, resampling and duration enforcement in the librosa.load() call. I suspect it's more efficient to rely on what they've implemented there and it would also reduce the number of function calls. https://librosa.org/doc/main/generated/librosa.load.html

Also, I see two potential issues regarding using duration from the start of file to enforce a length. The first issue is that truncation could introduce a discontinuity and extra spectral energy depending on where the speech stops. The second is that if there is a large amount of silence at the beginning of the file then the code could be cutting off speech content which is what we actually need.

In terms of the first issue I think it could be handled by applying a non-rectangular window if to the frame once the speech has been properly isolated. In terms of the second issue it seems Librosa already has a function to mitigate this: https://medium.com/@vvk.victory/audio-processing-librosa-split-on-silence-8e1edab07bbb

A better approach might be:

  1. Load file and resample/mix down to mono using Librosa.load()
  2. Isolate speech using the split on silence technique
  3. Zero-pad/truncate signal to meet length requirements
  4. Apply window to migitate any discontinuities which might have been introduced

What are your thoughts on the matter? If this sounds reasonable to you and time allows could you make the changes to support this?

maxsolomonhenry commented 2 years ago

Hi Graham,

Thanks for this thoughtful analysis.

  1. Load file and resample/mix down to mono using Librosa.load()

I think it's a great idea to add resampling into the read. I must have missed that option.

Better still would be torchaudio.load, because that goes straight to GPU (or has option to, I believe).

(I haven't been looking into torchaudio yet because it's not available for my local development env (mac m1).)

  1. Isolate speech using the split on silence technique

I agree about clipping the silence up front. As for silence clipping between words, I think it's informative to keep the original silence and pacing, as this might be telling of the language.

  1. Zero-pad/truncate signal to meet length requirements

I'm using a repeating technique, rather than zero padding, as per the paper cited in the code. Otherwise we truncate, yes.

  1. Apply window to migitate any discontinuities which might have been introduced

It is standard practice to clip without windowing. I know that seems a little primitive; but that's the overwhelming precedent. We could do it but I'm not convinced it's worth the cycles.

maxsolomonhenry commented 2 years ago

I think the big issue here is to get things moving as wquickly as possible. I.e., let's get everything happening in the GPU as much as possible, and as early as possible (torchaudio for mel spectrograms, e.g.).

dgsmith1988 commented 2 years ago

Those justifications work for me. Is it possible to add an option to select the mel spectrogram computation method so we could compare the two? This could also support different development environments before things are run on Goolge Colab.