Closed lunixbochs closed 4 years ago
Hi, You can find the python bindings here that you can plugin directly - https://github.com/facebookresearch/wav2letter/tree/master/bindings/python if you are linking with pytorch.
Why is this happening? Is it due to the HTK-style mfsc filterbanks?
Yes. We tried to match HTK features as close as possible while implementing. Here is a unit test where we test Mfcc features - https://github.com/facebookresearch/wav2letter/blob/master/src/feature/test/MfccTest.cpp
(I tried reproducing the mfsc filterbanks with librosa and got completely different weights with e.g. librosa.filters.mel(16000, 512, 40, htk=True) and I still don't completely understand why)
For getting mfsc features, one needs to take log
of the mel filterbank features. Here is a pseudo code that will give you the closest result to wav2letter
def extract_logmel_librosa(filename):
wav = load_wav(filename, "float_array")
# missing preemphasis
spec = librosa.core.stft(
wav, n_fft=FRAME_LENGTH, hop_length=FRAME_SHIFT,
win_length=FRAME_LENGTH, window="hamm", center=False,
)
mel = librosa.feature.melspectrogram(
S=numpy.abs(spec), sr=SAMPLING_RATE,
n_mels=N_COEFFS, fmax=SAMPLING_RATE / 2,
).T
logmel = librosa.core.power_to_db(mel)
return logmel.astype(numpy.float32)
Thanks for the info!
I know about the python bindings, also remember I am already shipping my own wav2letter fork to end users with a custom decoder and python bindings through my C api.
I’m working with Pytorch specifically to add new targets for my desktop app (such as Windows, pre-Haswell CPUs, ROCm, and GPU/CPU support in the same distribution). I’m also trying to keep my app size down, so I’m probably even going to be doing custom size constrained builds of Pytorch for this, and I definitely don’t want to ship both Pytorch and flashlight. So the goal right now is a pure Pytorch feature and bug compatible forward pass using the same model weights and network arch file.
At this point it’s mostly feature complete (framing, featurize, viterbi, model export + arch file loader, and forward pass are all working), just trying to track down and understand implementation details like this now. All that’s left major feature wise is my decoder, and ASG Loss if I feel like doing compatible training in Pytorch (which would be nice for retraining on end user machines, or doing even more advanced data augmentation)
I've been writing a pytorch frontend, and I'm currently porting mfsc featurization.
I just noticed SpeechUtils scales the float samples to integer range: https://github.com/facebookresearch/wav2letter/blob/master/src/libraries/feature/SpeechUtils.cpp#L31-L34
Why is this happening? Is it due to the HTK-style mfsc filterbanks? (I tried reproducing the mfsc filterbanks with librosa and got completely different weights with e.g.
librosa.filters.mel(16000, 512, 40, htk=True)
and I still don't completely understand why)