Testing failed when reproducing experiments by processing test*.py

n1243645679976 commented 3 years ago

Hi @archiki, I am trying to evaluate the model checkpoint you provided here by running the commands below.

# 1.
python test.py --test-manifest data/libri_test_clean_manifest.csv --SNR-start 0 --SNR-stop 20 --SNR-step 5
# 2.
python test_enhanced.py --test-manifest data/libri_test_clean_manifest.csv --SNR-start 0 --SNR-stop 20 --SNR-step 5
# 3.
python test_noisy.py --test-manifest data/libri_test_clean_manifest.csv --SNR-start 0 --SNR-stop 20 --SNR-step 5

The command is based on here, and the difference between the commands above is only the testing script. All of them throw exceptions and the error messages are:

# 1. 
Traceback (most recent call last):
  File "test.py", line 7, in <module>
    from data.data_loader_specAugment import SpectrogramDataset, AudioDataLoader
ModuleNotFoundError: No module named 'data.data_loader_specAugment'

# 2.
Traceback (most recent call last):
  File "test_enhanced.py", line 11, in <module>
    from utils_orig import load_model
ModuleNotFoundError: No module named 'utils_orig'

# 3.
Traceback (most recent call last):
  File "test_noisy.py", line 203, in <module>
    half=args.half, wer_dict= wer_dict,ifNoiseClassifier=args.ifNoiseClassifier,noise_model=noise_model,ifNoiseBinary=args.binary_noisy, print_summary=True)
  File "test_noisy.py", line 92, in evaluate
    out, output_sizes = model(inputs, input_sizes)
ValueError: too many values to unpack (expected 2)

Though I can modify the files to pass these exceptions, it will take time to find the way to reproduce experiment result provided in the table...

So, my question is

Is the command right?
Which test.py is the script to reproduce the experiment result in the table?

best, Cheng-Hung Hu

archiki commented 3 years ago

Hi @n1243645679976! The table that you have mentioned presents the WER scores on libri-test-clean so you do not need to add any SNR related arguments. You can simply use the test.py/test_enhanced.py. The command seems to have --model-path missing, you should probably add that unless you have hard-coded it at your end. I have made minor edits to test.py and added 'utils_orig.py` which should eliminate some of the errors. Note that greedy decoding (viterbi decoding) is the default setting.

Hope this helps! Best, Archiki

n1243645679976 commented 3 years ago

Thanks for your quick response and explanation! I am trying the test_enhanced.py for testing now, but the noise-related code(L36, L40-41, L90-91, L106-114, L154) still gives me error message, so I comment them and force it to evaluate. If there's any problem, I'll comment in this issue again. By the way, can I ask for the log of training/develop loss and the checkpoint after finetuning? I want to use it to compare with my experiment results.

best, Cheng-Hung Hu.

archiki commented 3 years ago

I am not sure about which lines you are referring to (L36 is the parser code for epochs), but one quick fix is to supply the SNR arguments but keep --noise-dir as empty or none, and --noise-prob 0. This will not add any noise and the evaluation can proceed.

n1243645679976 commented 3 years ago

Hi @archiki, I'm testing with test_enhanced.py, so for example, L36 means here. I doesn't change the parameters about audio_conf_noise such as noise-prob, noise-dir since the noise-prob is given 0 in your code and the target test set is already noisy.

Here's the process I test the noisy dataset. I modified the test_enhanced.py as I comment two days ago, where L154 is modified to half=args.half, wer_dict= wer_dict) (remove the args.SNR) Then, after I grouped the .wav files in the customed dataset(test_noisy_speech) by SNR and the type of noise, I tested them with the pretrained model, and I found the WER and the CER is different from the result shown in Table2 in your paper For instance, the Talbe shows the WER of Car 0dB is 35.0, but the WER I got is 45.683.

My problem is :

Is the pretrained model reported as the baseline in your paper? If not, could you provided your model?
If so, is my evaluation process different from yours? Such as dataset or others...

best, Cheng-Hung Hu

archiki commented 3 years ago

Hey @n1243645679976 ,

I fixed the edit you suggested in L154 of test_enhanced.py. However, at my end, I am able to reproduce the results mentioned in the table. I am attaching an image of the command as well as the results generated. I hope this will give you some clarity.

So the answer to your question 1 is yes, it's the same model. The answer to your question 2, is difficult for me to say since I don't have access to your setup. However, I have provided the test set used as you mentioned. I would recommend you double-check your manifest files, to ensure that you have matched the audio file with the correct transcript text. You should also check if you are able to re-create the clean WER of 10.3 using the standard libri-test-clean set.

Then, after I grouped the .wav files in the customed dataset(test_noisy_speech) by SNR and the type of noise, I tested them with

I am not sure what you mean by this, there is no need for you to group anything. As long as add all the files in the test_noisy_speech to the manifest appropriately, the testing code can group the files by noise-type and SNR. Hope this helps.

Best, Archiki

n1243645679976 commented 3 years ago

Hi @archiki, Thank you for running the experiment and the image is very helpful! I can reproduce the experiment result now! I found that the way to derive WER in your table and in the row starting with Test Summary are different and this is the point...

This is what happened: First, I just misunderstood that the way to derive WER in your table and in the row staring with Test Summary are the same. I comment the print_summary part in your code(L106-L114) because I think even if I group the wavfiles by SNR and noise type and test them only, it will still give me the same result with those in the table. So, when I group the wavfiles by Car and 0dB first and test them, I can only get the WER in the row starting with Test Summary, which gives me a WER(32.56) different from yours(35.0), and I just confused and can't observe the difference of WER.

Thanks for your help!

Best, Cheng-Hung Hu

archiki commented 3 years ago

Yes, the difference between the two is that under Test Summary, WER is calculated as [sum of all edit distances in the test set]/[sum of all lengths of transcripts] instead of [sum (edit distance/length of transcripts)]. Hope that makes sense to you.

archiki / Robust-E2E-ASR

Testing failed when reproducing experiments by processing test*.py #3