Closed huks0 closed 2 months ago
Just saw this. Citrinet is a CTC model, did you check If your audio after 8x down sampling was shorter than the text transcript tokens in subword count ? That's often the reason for dropped words, especially because German transcripts are usually verbose and contain long words.
Id suggest trying a fast conformer transducer instead, 105M model should match the perf and memory of Citrinet 512 quite easily.
Describe the bug
Currently I train a citrinet-512 Model. I copied the config from the example configs in here and didnt change them (https://github.com/NVIDIA/NeMo/tree/main/examples/asr/conf/citrinet). After detecting the issue, I also used another config of hugging face where someone finetuned a citrinet model (https://huggingface.co/neongeckocom/stt_de_citrinet_512_gamma_0_25). Both configs lead to the same problem. A lot of sentences are cut randomly during prediction like e.g.
This happens right after a few epochs (2 epochs) and doesnt vanish even after 90 epochs. Its not a dataset specific issue, but occurs randomly for several datasets. It does not happen for every sample but for a relevant share of the data. It happens for training and evaluation. I tried to figure out what the problem relates to or if any parameter could solve it, but couldnt detect where this issue comes from.
Steps/Code to reproduce bug
Here is the config used:
Expected behavior
I expect the training to predict the sentences correctly. For a lot of the sentences it works, for some the predictions are just cut even though the rest was welll detected. I believe this affects the loss and the WER and hence its hard to judge how good the model is actually.
Environment overview (please complete the following information)
On Azure I set up an environment and trained multi-gpu. NeMo is pip installed via nemo_toolkit==1.21.0
Environment details
tensorflow-2.8-cuda11 python=3.8 torch=2.3.0
Additional context