Which features to implement now?

buriburisuri / speech-to-text-wavenet

Speech-to-Text-WaveNet : End-to-end sentence level English speech recognition based on DeepMind's WaveNet and tensorflow

Apache License 2.0

3.95k stars 794 forks source link

Which features to implement now? #27

Open buriburisuri opened 7 years ago

buriburisuri commented 7 years ago

I think to add this features now.

1) Docker images

to resolve python, tensorflow and sugartensor version conflict
to help for just testing guys I want to include VCTK corpus and pre-trained weights

2) Data augmenting

to resolve overfitting problem.

3) Quantative analysis

Please, reply features you think important !!!

a00achild1 commented 7 years ago

I think data augmentation is important. Like increase the speech frequency, add background noise, alter the pitch etc...

Or maybe the database could be expanded for robust online usage. When I test the modle with LibriSpeech datasets, the results is not quite well. I was wondering if it is possible to add "LibriSpeech train-other-500" datasets as training data and maybe the performance will be improved.

LCLL commented 7 years ago

I agree with that data augmentation is more important.

@a00achild1 could you share the configs and results in LibriSpeech dataset? I am going to do some tests on this dataset and check if the model will work better as the data size is larger.

kaihuchen commented 7 years ago

I would love to see a version of this repo that will allow me to use it for discovering and generating the characteristic in speech or music samples, similar to what's supported in the original WaveNet. This is because while the fact that the original WaveNet is able to learn from raw waveform is really cool, but I find the MFCC approach adopted in this repo to be much more practical.

unic0x commented 7 years ago

I also agree with that data augmentation important but for testing a docker would a great addition.

migvel commented 7 years ago

Hello,

I tested the software, trained during a weekend and runs.

From all the data augmentation techniques for wave speech, which one do you think that would be better to experiment with first?

I understand from here http://speak.clsp.jhu.edu/uploads/publications/papers/1050_pdf.pdf that Speed perturbation might be something worth to try, what do you think?

buriburisuri commented 7 years ago

@migvel Thanks for your information.

I think that DeepSpeech(https://arxiv.org/abs/1412.5567) will be a good start point to the augmentation. ( see Section 4 in the paper ) In addition, I plan to apply pitch and speed variation. I think it'll be tough works T.T

Thank you

cooledge commented 7 years ago

I trained for 20 epochs. The suggested test of recognizing the training data worked great. I tried a wav file not from the test data, the result was "een ererdi" which is not even close to what was said in the wav file. I would suggest you split your data into training and test. Update the code to train on the training set and then run the test on data that the neural net was not trained on to see the effectiveness.

a00achild1 commented 7 years ago

@LCLL sorry for late reply. I used Sox to augment LibriSpeech dataset and used another project to train with it but couldn't get any good result until now, the loss always diverge after few iterations

And I havn't try LibriSpeech dataset on this project yet.

buriburisuri commented 7 years ago

@a00achild1 I've just downloaded LibriSpeech dataset. I'll try VCTK + Libri + Augment, for generalization.

a00achild1 commented 7 years ago

@buriburisuri Great! I'm going to try combining Libri, too.

pandeydivesh15 commented 7 years ago

Hello everyone. I am currently doing my GSoC project on speech recognition, based on Deep Speech. I wanted to ask you about data augmentation. What approach did u finally adopt? Was it the one given in paper itself or something else? Thank you

MXGray commented 7 years ago

@buriburisuri A language model would definitely be a good feature, so it could probably output correctly punctuated and cased text, aside from possibly increasing inferencing accuracy. :)

noetits commented 6 years ago

Does this implementation use the idea of "fast wavenet" you mention ? Basically they cache previous value not to compute them again. https://github.com/tomlepaine/fast-wavenet

If not, it would probably be a great feature...