OpenMOSS / AnyGPT

Code for "AnyGPT: Unified Multimodal LLM with Discrete Sequence Modeling"
783 stars 63 forks source link

Regarding ASR testing #40

Open Simplesss opened 3 months ago

Simplesss commented 3 months ago

Hello, thank you very much for your work. I would like to reproduce the ASR performance of the AnyGPT Base model on the Librispeech test clean. I noticed that your paper stated a WER of 8.5, but my test result was 14.5 (using the command format speech | text | {speech file path}). Therefore, I am speculating whether this result is caused by randomly selecting a prompt for ASR during each inference in the ASR task? If possible, could you share the relevant code for calculating WER (I used 7 Composers from jiwer for calculation), as well as the text result obtained from ASR of the model. Looking forward to your reply.

JunZhan2000 commented 2 months ago

Hello, I think it's probably not an issue with the prompt, each prompt has been seen many times during training. I would like to confirm two things: First, are you using beam search as your decoding strategy? This strategy generally produces the best results. Second, it's necessary to perform some post-processing on the transcription results to standardize them, because the output format of the LLM is very different from the ground truth, including punctuation and words like "you're" which shoud be "you are" in the groundtruth. I also use jiwer for caculating wer. Regarding the test code, unfortunately, it was lost during an environment migration, but I believe if you use GPT to write some standardization code, you should be able to achieve the results mentioned in the paper.(I didn't handle all the standardization cases)