Problem with decode result on SPGISpeech dataset - Githubissues

espnet / espnet

End-to-End Speech Processing Toolkit

https://espnet.github.io/espnet/

Apache License 2.0

8.48k stars 2.18k forks source link

Problem with decode result on SPGISpeech dataset #5897

Open Swagger-z opened 2 months ago

Swagger-z commented 2 months ago

Problem with decode result on SPGISpeech dataset

Hi, I downloaded the pretrained model from https://zenodo.org/record/4585546 and inference with different configs: config1 (correspond to decode_baseline):

lm_weight: 0.0 beam_size: 20 penalty: 0.0 maxlenratio: 0.0 minlenratio: 0.0 ctc_weight: 0.3

config2 (correspond to decode_baseline_wi_elm) :

lm_weight: 0.3 beam_size: 20 penalty: 0.0 maxlenratio: 0.0 minlenratio: 0.0 ctc_weight: 0.3

dataset Snt Wrd Corr Sub Del Ins Err S.Err

decode_baseline/val 39341 946469 95.5 2.7 1.8 1.1 5.7 58.1

decode_baseline_wi_elm/val 39341 946469 94.7 2.2 3.1 0.9 6.2 57.5

the performance gets worse after external language model integrated, and it's much wrose than the reported results in offical recipe, which is:

dataset Snt Wrd Corr Sub Del Ins Err S.Err

decode_asr_lm_lm_train_lm_en_bpe5000_valid.loss.ave_asr_model_valid.acc.ave/dev_4k 4000 95401 98.2 1.3 0.5 0.4 2.2 32.5

is there anything wrong?

sw005320 commented 2 months ago

LM shallow fusion does not always improve the performance. You may still tune lm_weight and might get some optimum performance.
You seem to compare the different test sets (val vs. dev_4k)

Swagger-z commented 2 months ago

sorry i made a mistake, the reported results of val set in offical recipe is

dataset Snt Wrd Corr Sub Del Ins Err S.Err

decode_asr_lm_lm_train_lm_en_bpe5000_valid.loss.ave_asr_model_valid.acc.ave/val 39341 946469 98.1 1.3 0.5 0.4 2.3 33.6

it seems a bit confusing since i just reproduced the decoding process using the pretrained model without any change, the WER is much higher.

sw005320 commented 2 months ago

I see. Thanks for pointing out it. Did you run it with the exact same decoding configuration, especially for the normalization?

Swagger-z commented 2 months ago

yes, that's it. I ran the normalized_text one (normalizded text, bpe 5000 (asr_train_asr_conformer6_n_fft512_hop_length256_raw_en_bpe5000)) with the decoding configuration provided in the official spgispeech recipe.

sw005320 commented 2 months ago

Can you attachthe results (results.txt generated by score.sh)? I want to check to see whether the normalization and so on are treated correctly.

Swagger-z commented 2 months ago

result.txt sure, i uploaded the results.txt just now, please check it.

sw005320 commented 2 months ago

Thanks a lot! I could not find a specific pattern in the recognition results, and it may take time to debug it... A possible reason would be that it may lose some compatibility. Could you tell me which version of espnet and pytorch did you use?

Swagger-z commented 2 months ago

of course sir，I use espnet-v.202310 and torch 1.13.1. Actually i have tried to compare the source code of espnet-v.0.9.8 (which i guess the version used for the pretrained model) and my version but find nothing...

sw005320 commented 2 months ago

I see. This is what I wanted to ask. One of my concerns is the positional embedding part. This is tricky and has caused some confusion in the past.

@pengchengguo, do you have any idea of this degradation?

sw005320 commented 2 months ago

@Swagger-z, it is too much to ask, but by any chance, can you run it with some older versions of espnet? I'm not sure that we can use espnet-v.0.9.8 anymore (might be possible only for inference).

pengchengguo commented 2 months ago

Understood, I will attempt to replicate the error first.