Canary-1b on long audio file.

RobinM30 commented 1 month ago

Hi, I am trying to make working canary-1b on long audio file. Inspiring my self from https://github.com/NVIDIA/NeMo/blob/b5798ded9f27168db9d7d77cbe4f9da80bf49268/examples/asr/asr_chunked_inference/aed/speech_to_text_aed_chunked_infer.py#L19 this notebook. But the results are not convincing in regards of what whisper can do. Seems that script just chunk the audio file but do not share context to help th model to transcribe.. More over it does not include a VAD model, and it seems difficult to timestamp the segments. Does anyone have succed to have something close to the whisper output, with better quality? Thanks

xdevfaheem commented 3 weeks ago

any leads?

nithinraok commented 3 weeks ago

Hi, thanks for the question and interest. We are working on improving the performance of canary on long form audio. We do also plan to release the model with timestamp generation soon.

NVIDIA / NeMo

Canary-1b on long audio file. #10487