NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
10.79k stars 2.26k forks source link

How to use a pre-trained model for cache-aware FastConformer-Hybrid model? #9502

Open sangeet2020 opened 1 week ago

sangeet2020 commented 1 week ago

Hi @titu1994,

Following our discussion in this thread, I’m training a cache-aware FastConformer hybrid CTC-RNNT model for German using 1.2K hours of audio data. Despite training for 150 epochs, my validation WER is still around 0.28.

I suspect the dataset quality might be an issue. I reviewed the paper "Stateful Conformer with Cache-Based Inference for Streaming ASR" and noted the significant performance achieved even with training from scratch on LibriSpeech.

Since you recommended using a pre-trained model, I tried using this model from Hugging Face, but it's not a streaming model. Is it still viable as a pre-trained model for my use case, or are there other German models available that you would recommend?

Thank you for your guidance!

titu1994 commented 1 week ago

You can use this model, which is a chunk aware model - https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi

sangeet2020 commented 1 week ago

Thank You @titu1994. I will try it. But this has been trained on labeled English dataset. I want to understand the logic, how would it adapt to any other language?

titu1994 commented 1 week ago

It's a practical limitation. You can either get a ordinary fast conformer in German or a chunk aware Conformer in English. Depends on what your priority is - streaming or transcript accuracy. We have tutorial showing language transfer