NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.83k stars 2.46k forks source link

conformer transducer timestamp extraction #5896

Closed Khimer closed 1 year ago

Khimer commented 1 year ago

Good day! Thanks for your hard work! I am trying to extract timestamps for a model stt_en_conformer_transducer_large Following https://colab.research.google.com/github/NVIDIA/NeMo/blob/stable/tutorials/asr/ASR_with_Transducers.ipynb#scrollTo=xkv_x8NAfpX3 But I'm having difficulty converting timestamps: "Note that each timestep here is (roughly) timestep∗total_stride_of_model∗preprocessor.window_stride seconds timestamp" - I can extract "timestep" from hypotheses, but model.preprocessor doesn't have a window_stride field and I'm not sure which value to use here. Also, I couldn't figure out where the "total_stride_of_model" value comes from. Do I understand correctly, total_stride_of_model == len(audio) / 'window_stride'? By the way, the 'window_stride' field is also missing from model.preprocessor. Thank you!

titu1994 commented 1 year ago

Model.cfg.preprocessor has those fields

titu1994 commented 1 year ago

Total stride is inherently part of model. There's no place in config that mentions it. Conformer and Squeezeformer has 4x stride, Citrinet has 8x stride, QuartzNet and Jasper have 2x stride.

Khimer commented 1 year ago

This helped, thanks a lot!