JohannesBuchner / spoken-command-recognition

A large, free audio sample database (10M words pronounced), a test bed for voice activity detection algorithms and for single-syllable word recognition
68 stars 31 forks source link

the Synthesized command dataset can work, or not? #3

Open awoniu opened 5 years ago

awoniu commented 5 years ago

Is this project finished successfully? is there any conclusion about using a synthesized dataset to train a model? I am thinking about do some similar experiment like this project and hope anybody can give some suggestion. Thx~

JohannesBuchner commented 5 years ago

Some projects have started using this data set for preliminary work, and you are more than welcome to do so as well (it is on Kaggle too). I myself do not have the expertise to develop elaborate RNNs etc., and am now focusing on other projects.

awoniu commented 5 years ago

Some projects have started using this data set for preliminary work, and you are more than welcome to do so as well (it is on Kaggle too). I myself do not have the expertise to develop elaborate RNNs etc., and am now focusing on other projects.

ok~. I have try to use a synthesized dataset( I make it by using a open source toolkit: soundtouch here is the toolkit's link: http://www.surina.net/soundtouch/ ) to train a RNN(GRU+DNN) model. here is a some preliminary result of my work : I got two command word audio( one is male and the other is female),and I change the pitch speed tempo, and add noise with different SNR level, and finally I got 3 thousands command words audio samples. after the model(GRU+DNN) training seems the model can easily recognize the synthesized command words, but cannot do well in audio from the true world.

JohannesBuchner commented 5 years ago

That is not overly surprising. Probably you want to use these synthetic data sets to extend real datasets. You can also try to increase the number of speakers, pronunciations and emphasis, as this project does.

diyism commented 1 year ago

@JohannesBuchner,

Dear sir, any progress about this project, I understand that you're busy with X-ray scanning of extraterrestrial planets, but this github project is also very important for scanning the voice of living creature on this planet.

I found that k2-fsa/sherpa-ncnn(with model "sherpa-ncnn-streaming-zipformer-zh-14M-2023-02-23") is very good at 2-syllables recognition of mandarin, but there's no a single-syllable recognition model of 1300 mandarin syllables(pinyin) currently.

I think your project is promiseful and very useful in LLM era, I very much agree your opinion in this project:

I do not need to have my computer "translate" sounds into text, or "understand" a meaning.
I just want to tell my computer a command and it does something. So I only need:soundwaves -> label

so I think that in LLM era, the ASR engines should focus more on the recognition of syllables, and the analysis of vocabulary and sentences should be left to Large Language Models(ChatGPT,Claude etc).

ref: https://github.com/k2-fsa/sherpa-ncnn/issues/177

JohannesBuchner commented 1 year ago

I think you can also extract the recordings of simple words from here: https://commonvoice.mozilla.org/en/datasets and take an architecture like https://github.com/mozilla/DeepSpeech and build a classifier of audio -> label={1,2,3,4,5,other}