NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
12.11k stars 2.52k forks source link

WER.update doesn't work #8585

Closed yuntang closed 8 months ago

yuntang commented 8 months ago

Describe the bug As shown in (https://github.com/NVIDIA/NeMo/blob/main/nemo/collections/asr/metrics/wer.py#L349-L350), the new scores and words will be assigned to the object and the pervious scores and words are dropped. 1) we might rename this function to WER.set or 2) we might update the code as

self.scores += torch.tensor(scores, device=self.scores.device, dtype=self.scores.dtype) self.words += torch.tensor(words, device=self.words.device, dtype=self.words.dtype)

The current code could lead to WER report inconsistent during training and inference if we use fuse_loss_wer in the Transducer model training, i.e., model.joint.fuse_loss_wer=True and model.joint. fused_batch_size > 1. In this setting, only the last sub-mini-batch WER is accumulated during validation stage.

titu1994 commented 8 months ago

Thank you very much for raising this! We have fixed it in this PR - https://github.com/NVIDIA/NeMo/pull/8587 It occurred due to a large refactor and unification of metrics in ASR to make it simpler to extend in the long run.

The patch will be there in the next NeMo release, and we have added a release note in the 1.23 release page https://github.com/NVIDIA/NeMo/releases/tag/v1.23.0 so that users are aware and can utilize correct metrics during evaluation by using the speech to text eval script (or disabling fused batch explicitly)

titu1994 commented 8 months ago

Fixed via #8587