DemisEom / SpecAugment

A Implementation of SpecAugment with Tensorflow & Pytorch, introduced by Google Brain
Apache License 2.0
641 stars 136 forks source link

from mel_spectrogram to wav again #10

Open kimchi88 opened 5 years ago

kimchi88 commented 5 years ago

Hi, Do you have any suggestion about how to re-build the audio file after augmentation?

KnowBetterHelps commented 5 years ago

The same question I want to ask,too. In my case, use librosa.feature.melspectrogram and then to compute librosa.feature.mfcc is not equal with kaldi's process.

BTW, did you find the way to re-build audio?

kimchi88 commented 5 years ago

Hi, nope.. still nothing.. but I've read some other post and it doesn't seems trivial.. there is a post in Kaldi github repository where developers are discussing about their findings after applying specaugment to existing kaldi recipes. Hope it helps!

KnowBetterHelps commented 5 years ago

thank you for your kind reply

I will looking for it

dkakaie commented 5 years ago

I spent a few hours yesterday for this. This is what I finally settled upon at least for now. Sorry for the delay in sharing this. New version of librosa seems to include the functionality we need here, see #844. However this is unreleased yet so you have to install from source. Version 0.7.0rc1 is what I used. You could do

recov = librosa.feature.inverse.mel_to_audio (M=warped_masked_spectrogram, 
    hop_length=128, sr=sampling_rate)

and use this function to save it

def save_wav (wav, path):
        wav *= 32767 / max (0.01, np.max(np.abs(wav)))
        scipy.io.wavfile.write (path, 16000, wav.astype(np.int16))
kimchi88 commented 5 years ago

Hi Roxima, Thanks for sharing! I'll give it a try :)

dkakaie commented 5 years ago

@kimchi88 Great. Looking forward to your results.

kimchi88 commented 5 years ago

confirmed! It works perfectly.. next step will be use the augmented audio to improve ASR. thanks for help!

darisettysuneel commented 5 years ago

Hi @roxima / @kimchi88,

Can you please confirm the time taken to convert from mel-spectrogram to wav and what is hardware configuration? bcs for me it is taking 2 to 3 min on cpu with 6 cores and 8 gb ram.

dkakaie commented 5 years ago

@darisettysuneel As much as I can remember it finishes very quickly. What takes time was augmentation and not saving resulting audio. I'll try to report back to you with a simple benchmark.

darisettysuneel commented 5 years ago

Hi @roxima

Any statistics can I get?

Lomax314 commented 5 years ago

@roxima Hi, I waste more time when convert mel_spectrogram to wav than augment the wav. Do you have any better solution? Thanks

dkakaie commented 5 years ago

@darisettysuneel @Lomax314 So sorry for being late, was as busy as a bee. I'm on Windows 10, x64, i3-6100U, 8Gb DDR4 RAM, 128GB SSD storage This is the result for the default sample audio in the repository:

Loaded audio in  0:00:00.509608
Tensorflow finished in  0:00:02.145270
librosa reconstructed audio in  0:00:25.873811
Audio saved in  0:00:00.005016
PyTorch finished in  0:00:00.050832
librosa reconstructed audio in  0:00:29.923980
Audio saved in  0:00:00.004015

As can be seen, reconstructing audio takes much more time compared with augmentations. However I noticed that running this script uses more than 8Gb of my OS drive free space, maybe there is a IO bottleneck?! Running this I get only 141Mb free space. No, have not found a better solution. Maybe librosa isn't still fully optimized for this stage.

dkakaie commented 5 years ago

Previous one used librosa 0.7.0RC1 and this is for the latest 0.7.0 release:

Loaded audio in  0:00:00.512629
Tensorflow finished in  0:00:02.180432
librosa reconstructed audio in  0:00:20.358577
Audio saved in  0:00:00.006011
PyTorch finished in  0:00:00.045847
librosa reconstructed audio in  0:00:43.839765
Audio saved in  0:00:00.004988

One more

Loaded audio in  0:00:00.505621
Tensorflow finished in  0:00:02.230296
librosa reconstructed audio in  0:00:32.860149
Audio saved in  0:00:00.006980
PyTorch finished in  0:00:00.052857
librosa reconstructed audio in  0:00:46.224405
Audio saved in  0:00:00.005985
darisettysuneel commented 5 years ago

@roxima Thanks for sharing the statistics! May I know the length of the audio files for provided results.

dkakaie commented 5 years ago

@darisettysuneel Your're welcome. Exactly 2s970ms

darisettysuneel commented 5 years ago

@roxima For me it is taking ~1.5 minutes for 8-10sec audio. I need to take a look at input data to reconstruction function. Once again thanks.

Lomax314 commented 5 years ago

@roxima Very thanks for ur reply! the function of the librosa takes much time for me so that i wish i can find other solution. Once again thanks.

AASHISHAG commented 5 years ago

@darisettysuneel @Lomax314 : Did you find any other better method to achieve it?

Lomax314 commented 5 years ago

@AASHISHAG I'm sorry about that the answer is NO.However,this method seemd to be implemented in function of the kaldi'repository

AASHISHAG commented 5 years ago

@Lomax314 : Thank you for the reply. I will have a look.

If you still have the setup running, could you please help me with the tensorflowand tensorflow_addons and gcc version. I am trying to run the test script as given in the readme but getting some errors on from specAugment import spec_augment_tensorflow

import glob
import scipy
import librosa
import numpy as np
from specAugment import spec_augment_tensorflow

mozilla_augmented = '/mozilla_augmented/clips/*.wav'

for audio_path in glob.iglob(mozilla_augmented):
    print(audio_path)
    audio, sampling_rate = librosa.load(audio_path)
    mel_spectrogram = librosa.feature.melspectrogram(y=audio,
                                                     sr=sampling_rate,
                                                     n_mels=256,
                                                     hop_length=128,
                                                     fmax=8000)
    warped_masked_spectrogram = spec_augment_tensorflow.spec_augment(mel_spectrogram=mel_spectrogram)
    wav = librosa.feature.inverse.mel_to_audio (M=warped_masked_spectrogram, hop_length=128, sr=sampling_rate)
    wav *= 32767 / max (0.01, np.max(np.abs(wav)))
    scipy.io.wavfile.write (audio_path, 16000, wav.astype(np.int16))
junaedifahmi commented 4 years ago

@roxima For me it is taking ~1.5 minutes for 8-10sec audio. I need to take a look at input data to reconstruction function. Once again thanks.

It takes me 10 minutes for 10 sec audio for me, the machine have 88 cores with 500GB memory, I use the last code to convert to audio, do you have any better solution? maybe with torch audio? thanks.

AASHISHAG commented 4 years ago

@juunnn : Could you please confirm your tensorflow and gcc version? I am facing some dependency issue. I think it has to do with tensorflow and gcc. The best would be, if you can give the output of the following command: pip3 list

This will list all the versions.

junaedifahmi commented 4 years ago

I still have problem with tf dependenci, that's why I use pytorch for them. It works, and don't have a long time to execute, but for some audio it says "output have no finite value everywhere" while compiling back to audio. I dont know what to do,

AASHISHAG commented 4 years ago

@juunnn : Could you please share your code, that you wrote with PyTorch dependencies. I don't have exposure to either PyTorch or Tensorflow. It would be really helpful.

I am using the below code and facing dependencies issues.

import glob
import scipy
import librosa
import numpy as np
from specAugment import spec_augment_tensorflow

mozilla_augmented = '/mozilla_augmented/clips/*.wav'

for audio_path in glob.iglob(mozilla_augmented):
    print(audio_path)
    audio, sampling_rate = librosa.load(audio_path)
    mel_spectrogram = librosa.feature.melspectrogram(y=audio,
                                                     sr=sampling_rate,
                                                     n_mels=256,
                                                     hop_length=128,
                                                     fmax=8000)
    warped_masked_spectrogram = spec_augment_tensorflow.spec_augment(mel_spectrogram=mel_spectrogram)
    wav = librosa.feature.inverse.mel_to_audio (M=warped_masked_spectrogram, hop_length=128, sr=sampling_rate)
    wav *= 32767 / max (0.01, np.max(np.abs(wav)))
    scipy.io.wavfile.write (audio_path, 16000, wav.astype(np.int16))
ma7555 commented 4 years ago

it indeed takes a lot of time to convert from mel_spectogram to audio, if someone gets across a faster way instead of librosa built in please share.

For a 1 minute audio with 128 mels

CPU times: user 8min 32s, sys: 5min 11s, total: 13min 43s
Wall time: 7min 14s
neel04 commented 3 years ago

Any new updates for possibly faster implementations?