Open sangeet2020 opened 1 week ago
You can use this model, which is a chunk aware model - https://huggingface.co/nvidia/stt_en_fastconformer_hybrid_large_streaming_multi
Thank You @titu1994. I will try it. But this has been trained on labeled English dataset. I want to understand the logic, how would it adapt to any other language?
It's a practical limitation. You can either get a ordinary fast conformer in German or a chunk aware Conformer in English. Depends on what your priority is - streaming or transcript accuracy. We have tutorial showing language transfer
Hi @titu1994,
Following our discussion in this thread, I’m training a cache-aware FastConformer hybrid CTC-RNNT model for German using 1.2K hours of audio data. Despite training for 150 epochs, my validation WER is still around 0.28.
I suspect the dataset quality might be an issue. I reviewed the paper "Stateful Conformer with Cache-Based Inference for Streaming ASR" and noted the significant performance achieved even with training from scratch on LibriSpeech.
Since you recommended using a pre-trained model, I tried using this model from Hugging Face, but it's not a streaming model. Is it still viable as a pre-trained model for my use case, or are there other German models available that you would recommend?
Thank you for your guidance!