CRNN question - Githubissues

meixitu commented 6 years ago

Dear @navsuda , Sorry to bother you again,

I found, in your paper, Figure 3, CRNN feed all the timesteps output of GRU to the Fully Connected Layer. But in the code, only the last timestep output of GRU is feeded to Fully Connected Layer. I think the original paper of CRNN(convolutional recurrent neural networks for small-footprint keyword spotting) use all the timesteps. What do you consider it?
And I found in the original paper of CRNN, it use bi-direction GRU and layer normalization, but in your code, you seems not prefer to Bi-direction and layer normalization. What is your consideration?
In the original CRNN paper, it use deepspeech2 to align the keyword audio, But Google's speech command dataset don't align the keyword, it just make sure the keyword is in the whole *.wav, but it does not align the keyword, the format is [random length of filler, keyword, residue length of filler]. Actually I can't understand the original CRNN how to align the keyword, is it [zero, keyword], [keyword, zero], or [random length of filler, keyword, residue length of filler]. Do you consider it? I think align should have affect to the performance, because the Fully connection Layer don't have the ability to handle the invariance of time shift.
1. Do you know where can get another big dataset of keyword spotting? Because in the google's speech command dataset, it only has almost 2000 audio files for each keyword, Maybe it is not enough

Thanks Jinhong

navsuda commented 6 years ago

Hi @zhangjinhong17 ,

You are right, the original paper uses all the time steps concatenated concatenated into fully-connected layer, but that did not give any higher accuracy than using just the last time step on this dataset.
Bi-directional GRU and layer normalization are other hyperparameters, which didn't seem to improve the accuracy on this dataset. On a new dataset you may have to try out these options too. Note that if you are using bi-directional GRU, you may have to do concat all the timesteps and do fully-connected layer.
From my understanding, in the CRNN paper, the alignments generated from DeepSpeech2 would of the format [silence, silence, silence,...,T,T,A,A,A,L,K,K,T,T,I,I,I,M,M,E,silence,silence] which would be converted to [filler, keyword, residual filler]. For more details on CTC used in DeepSpeech2 see this. If you use such frame-level aligned dataset to train the model, it should ideally give a more accurate model. You are right, that fully connected layer can't handle the invariance in timeshift unless you train with it in the dataset (random time shift augmentation helps a bit here). If you get a chance to generate the alignments for the speech commands dataset, please consider open-sourcing/sharing it.
I did not hear of another keyword spotting dataset as large as this one.

meixitu commented 6 years ago

Hi @navsuda , Thanks for your help. Actually, why I concern the BiRNN or alignment, because my personal training result is worse than CRNN original paper claimed. I want to figure out why. BiRNN helps a little.But still not enough. It is really very tough.

I found layer normalization is really very slow, I have GPU. Do you know why?
I found birnn with last timestep is still work.
In the website you mentioned, it is the principle of CTC algorithm. But actually, in the other documents, it's output is spark, it don't have so many repeat output. I think the alignment maybe is [filler, keyword].
I don't know how to get the alignments for speech commands dataset, it need many many time if do it manually. Thank you again.

Jinhong

navsuda commented 6 years ago

Hi @zhangjinhong17, I'm not sure why layer normalization is slow. Can you check if your GPU is being utilized at all?

meixitu commented 6 years ago

Hi @navsuda , The GPU is working. I can check the GPU status. I will check your original code layer normalization speed. Thanks Jinhong

navsuda commented 6 years ago

Closing the issue due to inactivity, please reopen it if you still face the issue.

ARM-software / ML-KWS-for-MCU

CRNN question #12