Open kimchi88 opened 5 years ago
The same question I want to ask,too. In my case, use librosa.feature.melspectrogram and then to compute librosa.feature.mfcc is not equal with kaldi's process.
BTW, did you find the way to re-build audio?
Hi, nope.. still nothing.. but I've read some other post and it doesn't seems trivial.. there is a post in Kaldi github repository where developers are discussing about their findings after applying specaugment to existing kaldi recipes. Hope it helps!
thank you for your kind reply
I will looking for it
I spent a few hours yesterday for this. This is what I finally settled upon at least for now. Sorry for the delay in sharing this. New version of librosa seems to include the functionality we need here, see #844. However this is unreleased yet so you have to install from source. Version 0.7.0rc1 is what I used. You could do
recov = librosa.feature.inverse.mel_to_audio (M=warped_masked_spectrogram,
hop_length=128, sr=sampling_rate)
and use this function to save it
def save_wav (wav, path):
wav *= 32767 / max (0.01, np.max(np.abs(wav)))
scipy.io.wavfile.write (path, 16000, wav.astype(np.int16))
Hi Roxima, Thanks for sharing! I'll give it a try :)
@kimchi88 Great. Looking forward to your results.
confirmed! It works perfectly.. next step will be use the augmented audio to improve ASR. thanks for help!
Hi @roxima / @kimchi88,
Can you please confirm the time taken to convert from mel-spectrogram to wav and what is hardware configuration? bcs for me it is taking 2 to 3 min on cpu with 6 cores and 8 gb ram.
@darisettysuneel As much as I can remember it finishes very quickly. What takes time was augmentation and not saving resulting audio. I'll try to report back to you with a simple benchmark.
Hi @roxima
Any statistics can I get?
@roxima Hi, I waste more time when convert mel_spectrogram to wav than augment the wav. Do you have any better solution? Thanks
@darisettysuneel @Lomax314 So sorry for being late, was as busy as a bee. I'm on Windows 10, x64, i3-6100U, 8Gb DDR4 RAM, 128GB SSD storage This is the result for the default sample audio in the repository:
Loaded audio in 0:00:00.509608
Tensorflow finished in 0:00:02.145270
librosa reconstructed audio in 0:00:25.873811
Audio saved in 0:00:00.005016
PyTorch finished in 0:00:00.050832
librosa reconstructed audio in 0:00:29.923980
Audio saved in 0:00:00.004015
As can be seen, reconstructing audio takes much more time compared with augmentations. However I noticed that running this script uses more than 8Gb of my OS drive free space, maybe there is a IO bottleneck?! Running this I get only 141Mb free space. No, have not found a better solution. Maybe librosa isn't still fully optimized for this stage.
Previous one used librosa 0.7.0RC1 and this is for the latest 0.7.0 release:
Loaded audio in 0:00:00.512629
Tensorflow finished in 0:00:02.180432
librosa reconstructed audio in 0:00:20.358577
Audio saved in 0:00:00.006011
PyTorch finished in 0:00:00.045847
librosa reconstructed audio in 0:00:43.839765
Audio saved in 0:00:00.004988
One more
Loaded audio in 0:00:00.505621
Tensorflow finished in 0:00:02.230296
librosa reconstructed audio in 0:00:32.860149
Audio saved in 0:00:00.006980
PyTorch finished in 0:00:00.052857
librosa reconstructed audio in 0:00:46.224405
Audio saved in 0:00:00.005985
@roxima Thanks for sharing the statistics! May I know the length of the audio files for provided results.
@darisettysuneel Your're welcome. Exactly 2s970ms
@roxima For me it is taking ~1.5 minutes for 8-10sec audio. I need to take a look at input data to reconstruction function. Once again thanks.
@roxima Very thanks for ur reply! the function of the librosa takes much time for me so that i wish i can find other solution. Once again thanks.
@darisettysuneel @Lomax314 : Did you find any other better method to achieve it?
@AASHISHAG I'm sorry about that the answer is NO.However,this method seemd to be implemented in function of the kaldi'repository
@Lomax314 : Thank you for the reply. I will have a look.
If you still have the setup running, could you please help me with the tensorflow
and tensorflow_addons
and gcc
version. I am trying to run the test script as given in the readme but getting some errors on from specAugment import spec_augment_tensorflow
import glob
import scipy
import librosa
import numpy as np
from specAugment import spec_augment_tensorflow
mozilla_augmented = '/mozilla_augmented/clips/*.wav'
for audio_path in glob.iglob(mozilla_augmented):
print(audio_path)
audio, sampling_rate = librosa.load(audio_path)
mel_spectrogram = librosa.feature.melspectrogram(y=audio,
sr=sampling_rate,
n_mels=256,
hop_length=128,
fmax=8000)
warped_masked_spectrogram = spec_augment_tensorflow.spec_augment(mel_spectrogram=mel_spectrogram)
wav = librosa.feature.inverse.mel_to_audio (M=warped_masked_spectrogram, hop_length=128, sr=sampling_rate)
wav *= 32767 / max (0.01, np.max(np.abs(wav)))
scipy.io.wavfile.write (audio_path, 16000, wav.astype(np.int16))
@roxima For me it is taking ~1.5 minutes for 8-10sec audio. I need to take a look at input data to reconstruction function. Once again thanks.
It takes me 10 minutes for 10 sec audio for me, the machine have 88 cores with 500GB memory, I use the last code to convert to audio, do you have any better solution? maybe with torch audio? thanks.
@juunnn : Could you please confirm your tensorflow and gcc version? I am facing some dependency issue. I think it has to do with tensorflow and gcc.
The best would be, if you can give the output of the following command: pip3 list
This will list all the versions.
I still have problem with tf dependenci, that's why I use pytorch for them. It works, and don't have a long time to execute, but for some audio it says "output have no finite value everywhere" while compiling back to audio. I dont know what to do,
@juunnn : Could you please share your code, that you wrote with PyTorch dependencies. I don't have exposure to either PyTorch or Tensorflow. It would be really helpful.
I am using the below code and facing dependencies issues.
import glob
import scipy
import librosa
import numpy as np
from specAugment import spec_augment_tensorflow
mozilla_augmented = '/mozilla_augmented/clips/*.wav'
for audio_path in glob.iglob(mozilla_augmented):
print(audio_path)
audio, sampling_rate = librosa.load(audio_path)
mel_spectrogram = librosa.feature.melspectrogram(y=audio,
sr=sampling_rate,
n_mels=256,
hop_length=128,
fmax=8000)
warped_masked_spectrogram = spec_augment_tensorflow.spec_augment(mel_spectrogram=mel_spectrogram)
wav = librosa.feature.inverse.mel_to_audio (M=warped_masked_spectrogram, hop_length=128, sr=sampling_rate)
wav *= 32767 / max (0.01, np.max(np.abs(wav)))
scipy.io.wavfile.write (audio_path, 16000, wav.astype(np.int16))
it indeed takes a lot of time to convert from mel_spectogram to audio, if someone gets across a faster way instead of librosa built in please share.
For a 1 minute audio with 128 mels
CPU times: user 8min 32s, sys: 5min 11s, total: 13min 43s
Wall time: 7min 14s
Any new updates for possibly faster implementations?
Hi, Do you have any suggestion about how to re-build the audio file after augmentation?