Open saikishor opened 7 years ago
I have the same question. We can use this "python evaluate.py --manifest val:/path/to/manifest.csv --model_file /path/to/saved_model.prm --inference_file /path/to/outputfile_pickle file" and only write one wav file in manifest file. However, manifest file requires 2nd data element - the transcript file.
You have to know the wav content to generate the transcript file.
However, if I don't know and just want to parse the wav file, how should I do?
@p8778ter I am also waiting for that. I guess it is difficult to do this way, because the dataloader and some functions are out of insight, for me to even try to modify it and achieve the results
I also tried to modify evaluate.py. I used a dummy transcript file as a place holder in manifest file and use the pre-trained model just do predict, no validation. I got the result, but it is terribly wrong. I need Intel deep speech team to give us some insight.
yes yes the results are terrible i don't know why
I've created a branch tyler/evaluate_single
with a script called evaluate_single.py
in it. I've tested it briefly and it seems to work as expected for files from the librispeech's dev-clean
dataset.
To use it, first checkout my branch. Then, from within the speech directory you should be able to run: ./evaluate_single.py --model_file <model_file> <sequence of audio files>
. For instance, ./evaluate_single.py --model_file /data/librispeech_16_epochs.prm /data/LibriSpeech/dev-clean/1272/128104/1272-128104-*.flac
. It should print out something like:
File: /data/LibriSpeech/dev-clean/1272/128104/1272-128104-0001.flac
Transcript: NOR IS MISTER QOLTERS MANNER LESS INTEESTING THUN HIS MATTER
If this works well for you, I'll make a PR and get it in shortly.
I tried this evaluate_single.py. It works.
However the cer is still high if sound was encoded with 32K frequency. Using 16K frequency is better. The cer of 16K is about 60%.
One sentence should be "that the principle concerns that our central bank has had for a number of years most visibly since" predict as "THATTE PRINCE VO ONCERN THAT THE PAERSENTRAL BANK AS HADRN UMBER BEAR ESAMOST VISIBLY SINSAKE"
Another issue is the predicted sentence is trimmed off and output is very short. I input a 60 seconds sound file, it only generated about 5 seconds. Could we make the prediction output longer or do we have to split the 60 seconds into 12 small pieces?
Another 2 important questions are:
What are the best audio encoding parameters?
If I use MY data to continue train librispeech_16_epochs.prm, could it improve the cer?
Thanks
Thanks for the feedback! I forgot to mention that the sample rate was being hardcoded to 16k as it was in librispeech. At the moment, Aeon (the dataloader) doesn't support variable sample rates, and so it must be provided. There are a few other encoding restrictions that you can find here: http://aeon.nervanasys.com/index.html/provider_audio.html.
Trimming the sound should produce better results. We found that longer duration clips are generally much harder for the network.
Adding your own data for training should help things. Librispeech is a very specific style of read-speech, which can make it difficult to generalize to different types of speech.
Hello,Where can I find evaluate_single.py? There is no such file now. @tyler-nervana Thank you!
Hi, you can find the file in a branch here: https://github.com/NervanaSystems/deepspeech/blob/tyler/evaluate_single/speech/evaluate_single.py.
I would like to know, is there any method to directly parse the wav file and get the output of the model as a text file, without using any manifest file.