Closed n1243645679976 closed 3 years ago
Hi @n1243645679976!
The table that you have mentioned presents the WER scores on libri-test-clean
so you do not need to add any SNR related arguments. You can simply use the test.py
/test_enhanced.py
. The command seems to have --model-path
missing, you should probably add that unless you have hard-coded it at your end. I have made minor edits to test.py
and added 'utils_orig.py` which should eliminate some of the errors. Note that greedy decoding (viterbi decoding) is the default setting.
Hope this helps! Best, Archiki
Thanks for your quick response and explanation!
I am trying the test_enhanced.py
for testing now, but the noise-related code(L36, L40-41, L90-91, L106-114, L154) still gives me error message, so I comment them and force it to evaluate. If there's any problem, I'll comment in this issue again.
By the way, can I ask for the log of training/develop loss and the checkpoint after finetuning?
I want to use it to compare with my experiment results.
best, Cheng-Hung Hu.
I am not sure about which lines you are referring to (L36 is the parser code for epochs), but one quick fix is to supply the SNR arguments but keep --noise-dir
as empty or none, and --noise-prob 0
. This will not add any noise and the evaluation can proceed.
Hi @archiki,
I'm testing with test_enhanced.py
, so for example, L36 means here. I doesn't change the parameters about audio_conf_noise such as noise-prob, noise-dir since the noise-prob is given 0 in your code and the target test set is already noisy.
Here's the process I test the noisy dataset.
I modified the test_enhanced.py as I comment two days ago, where L154 is modified to half=args.half, wer_dict= wer_dict)
(remove the args.SNR)
Then, after I grouped the .wav files in the customed dataset(test_noisy_speech) by SNR and the type of noise, I tested them with the pretrained model, and I found the WER and the CER is different from the result shown in Table2 in your paper
For instance, the Talbe shows the WER of Car 0dB is 35.0, but the WER I got is 45.683.
My problem is :
best, Cheng-Hung Hu
Hey @n1243645679976 ,
I fixed the edit you suggested in L154 of test_enhanced.py
. However, at my end, I am able to reproduce the results mentioned in the table. I am attaching an image of the command as well as the results generated. I hope this will give you some clarity.
So the answer to your question 1 is yes, it's the same model. The answer to your question 2, is difficult for me to say since I don't have access to your setup. However, I have provided the test set used as you mentioned. I would recommend you double-check your manifest files, to ensure that you have matched the audio file with the correct transcript text. You should also check if you are able to re-create the clean WER of 10.3 using the standard libri-test-clean
set.
Then, after I grouped the .wav files in the customed dataset(test_noisy_speech) by SNR and the type of noise, I tested them with
I am not sure what you mean by this, there is no need for you to group anything. As long as add all the files in the test_noisy_speech
to the manifest appropriately, the testing code can group the files by noise-type and SNR. Hope this helps.
Best, Archiki
Hi @archiki,
Thank you for running the experiment and the image is very helpful!
I can reproduce the experiment result now!
I found that the way to derive WER in your table and in the row starting with Test Summary
are different and this is the point...
This is what happened:
First, I just misunderstood that the way to derive WER in your table and in the row staring with Test Summary
are the same.
I comment the print_summary
part in your code(L106-L114) because I think even if I group the wavfiles by SNR and noise type and test them only, it will still give me the same result with those in the table.
So, when I group the wavfiles by Car
and 0dB
first and test them, I can only get the WER in the row starting with Test Summary
, which gives me a WER(32.56) different from yours(35.0), and I just confused and can't observe the difference of WER.
Thanks for your help!
Best, Cheng-Hung Hu
Yes, the difference between the two is that under Test Summary
, WER is calculated as [sum of all edit distances in the test set]/[sum of all lengths of transcripts] instead of [sum (edit distance/length of transcripts)]. Hope that makes sense to you.
Hi @archiki, I am trying to evaluate the model checkpoint you provided here by running the commands below.
The command is based on here, and the difference between the commands above is only the testing script. All of them throw exceptions and the error messages are:
Though I can modify the files to pass these exceptions, it will take time to find the way to reproduce experiment result provided in the table...
So, my question is
test.py
is the script to reproduce the experiment result in the table?best, Cheng-Hung Hu