NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.02k stars 2.5k forks source link

how to compute log filter bank energy in audio_preprocessing (asr collection in nemo) compare with python_speech_feature? #1787

Closed trangtv57 closed 3 years ago

trangtv57 commented 3 years ago

I want re-procedure result like when i use compute log-filterbank energy of lib: python_speech_feature by using torchaudio. this is my code:

# load audio data by librosa
path_audio = "audio_a.wav"
y, sr = librosa.load(path_audio, sr=16000, offset=0.5, duration=0.4)

# load audio data by torch audio
audio_ft, sr = torchaudio.load(path_audio)
audio_ft = audio_ft.squeeze(0)
y_torch = audio_ft[int(0.5*16000):int(0.9*16000)]

# the result is the same then i compute log filterbank energy
# log filter bank energy compute by python_speech_feature lib
ft_f_bank = python_speech_features.logfbank(y, samplerate=16000, winlen=0.025, winstep=0.01, nfilt=64,nfft=512)
print(ft_f_bank.shape) # result: (39, 64)

#  log filter bank energy compute by FilterbankFeatures module in audio_preprocessing (asr collection in nemo)
 self.featurizer = FilterbankFeatures(sample_rate=16000, n_window_size=int(0.025*16000), n_window_stride=(0.01* 16000), n_fft=64, log=True)
ft_by_f_bank_nemo = self.fearturizer(y_torch) #result shape: (41, 64)

# log filter bank energy compute by torch audio kaldi compliance
ft_f_bank_by_torch = torchaudio.compliance.kaldi.fbank(y_torch, sample_frequency=16000.0, frame_length=25.0, frame_shift=10.0, use_log_fbank=True, use_energy=True, num_mel_bins=64)
print(ft_f_bank_by_torch.shape) # result: (38, 65)

How can i make result return by module filterbankfeature in nemo or torchaudio is the same with python speech feature. I'm not have deep understand more about speech feature, so question can so weird, sorry. Thankyou my torchaudio: 0.6.0, pytorch:1.6.0

nithinraok commented 3 years ago

In FilterbankFeatures class we extend time length features to multiples of pad_to value here 16(default) for faster processing. Try changing that to -1 and try again.