NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.96k stars 2.49k forks source link

Canary-1b on long audio file. #10487

Open RobinM30 opened 1 month ago

RobinM30 commented 1 month ago

Hi, I am trying to make working canary-1b on long audio file. Inspiring my self from https://github.com/NVIDIA/NeMo/blob/b5798ded9f27168db9d7d77cbe4f9da80bf49268/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py#L19 this notebook. But the results are not convincing in regards of what whisper can do. Seems that script just chunk the audio file but do not share context to help th model to transcribe.. More over it does not include a VAD model, and it seems difficult to timestamp the segments. Does anyone have succed to have something close to the whisper output, with better quality? Thanks

xdevfaheem commented 3 weeks ago

any leads?

nithinraok commented 3 weeks ago

Hi, thanks for the question and interest. We are working on improving the performance of canary on long form audio. We do also plan to release the model with timestamp generation soon.