castorini / honk

PyTorch implementations of neural network models for keyword spotting
http://honk.ai/
MIT License
511 stars 124 forks source link

Any chance Honk 2 might be in the wings? #109

Closed StuartIanNaylor closed 4 years ago

StuartIanNaylor commented 4 years ago

Pytorch now has TORCHAUDIO so LibRosa and certain architecture install problems are not needed.

Also newer models such as CRNN & DS-CNN look interesting, but it would be great if some of these incremental additions could be integrated into an example of Honk2.

Stuart

lintool commented 4 years ago

Check out https://github.com/castorini/howl - https://arxiv.org/abs/2008.09606

StuartIanNaylor commented 4 years ago

Oh apols Firefox Voice through me a curveball and didn't look closely.

Just had a look at the requirements.txt

pydantic==1.5.1
webrtcvad==2.0.10
numpy==1.18.3
torch>=1.0.0,<=1.5.0
tqdm
pandas==1.0.3
soundfile==0.10.3.post1
torchaudio==0.5.0
librosa==0.7.2
torchvision<=0.6.99,>=0.6.0
pyaudio
praat-textgrids==1.3.1

My python skills are awful :) and trying to hack something together and Librosa seems to cause much fun with Aarch64 and a raspberry pi.

What I have been trying to do is take the great Hotword Model Generator from Linto as thinking the Sonopy MFCC is not that optimised and also Pytorch seems to be gaining preference over Keras. https://github.com/linto-ai/linto-desktoptools-hmg The HMG has been an eye opener for me as the Google Command set I thought was verified but because of the simple test and verification tools of HMG often approx 8% are just junk samples.

The above is confusing as it has the problematic librosa, think its the 'numba' jit compiler that is the source of problems, but why librosa when Torchaudio is inplace. I am also asking out of interest as I was presuming the Torchaudio routines would be far better optimised than librosa python through the numba compiler?

Then again there is webrtcvad in there and my only disappointment with pytorchaudio is the sox like vad implementation rather than a more webrtcvad. I do keep thinking though all these 'FFT' routines are looking at the same frames in multiple threads unnecessarily. Its like PyTorch-PCEN which is brilliant prob could drop librosa and now use pytorch audio and as it says

-- all the while being resource-efficient and easy to implement

Has it not done much of what VAD needs anyway as it probably is resource efficient and so is webrtcvad but when you add them all up together maybe not as there is much function with MFCC creation that is just repetition? Thats my torchaudio angst and maybe more is to come as if you are going to have such a thing as torchaudio then you would think the aim would be for it to be an all in one for the purpose of pytorch audio ...?

I am only sharing my frustrations that with limited skill it is very frustrating not to be able to find what you would think would be a default part of linux infrastructure by now :)

I have a hunch librosa and a few of those requirements with Aarch64 on a lowly Pi with my skill set might not see fruition, but thanks for the link and I will give it a go.

Cheers.