NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

[Question] Citrinet and LM for online ASR #2290

Closed ValeryNikiforov closed 3 years ago

ValeryNikiforov commented 3 years ago

Hello, Thank you for great toolkit, tutorials and models. I have some questions:

  1. I want to use pretrained Citrinet in Online_ASR_Microphone_Demo instead of QuartzNet. I changed normalization to 'per_feature' and initialized EncDecCTCModelBPE. What else do I need to change for the model to work correctly?

  2. Do you plan to release a tutorial for online ASR with LM for Citrinet or QuartzNet?

titu1994 commented 3 years ago

Per feature normalization will not be sufficient to perform normalization over small buffer sizes, that's why a fixed mean and std was used for QuartzNet. You can try it, but I would not expect great results.

There are scripts for LM usage with both those types of models - https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html

There will also be addition of neural rescoring in the recent future.

ValeryNikiforov commented 3 years ago

@titu1994 Thank you. I think that I can't use original std and mean constants because of different shape ([1,64] in notebook, Citrinet requires [1,80]). Can you share fixed mean and std constants for Citrinet models (in case you have it)?

Also, I met a problem with notebook modification for Citrinet. After changing asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained('QuartzNet15x5Base-En') (works ok with per feature normalization) to asr_model = nemo_asr.models.ctc_bpe_models.EncDecCTCModelBPE.from_pretrained('stt_en_citrinet_256') (still using per feature) output text is always empty (and logits.shape[0] is 0). Can you please give me some advice? What could I have missed?

titu1994 commented 3 years ago

We haven't precomputed these values for the larger datasets that we currently train on.

I don't know the timeline, but @jbalam-nv is working on streaming asr notebook, which mostly doesn't require this pre calculated normalization tensor.

Logits.shape[0] corresponds to batch dimension, how is batch dim 0?

ValeryNikiforov commented 3 years ago

@titu1994 Ok, thank you. I think I found the problem - after loading Citrinet instead of QuartzNet timestep_duration is very large (327.68, this is strange..) and n_timesteps_overlap = int(frame_overlap / timestep_duration) - 2 equals to -2. After that logits[self.n_timesteps_overlap:-self.n_timesteps_overlap] is empty when I call _greedy_decoder.

n_timesteps_overlap calculation takes values from Citrinet config. I will try to fix that problem. Everything works fine when I switch back to QuartzNet.

titu1994 commented 3 years ago

Citrinet performs 8x downsampling, vs QuartzNet which does 2x downsampling. That would be the source of the problem. The notebook was designed to work with 2x and would need modifications to work with 8x.

aayush6897 commented 3 years ago

@ValeryNikiforov did you succeed in running Citrinet over streaming?

ValeryNikiforov commented 3 years ago

@aayush6897 Yes, I decode small chunks with new audio data (+ left side context: 2-4 seconds) and use beamsearch_ngram after that. For me, text merging is the main problem. I switched to ctcdecode decoder and it's output token's timestamps helped me a lot.

arunvenkatesan-nv commented 3 years ago

Streaming tutorial that should work with Both Conformer-CTC and Citrinet are now in NeMo: https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Streaming_ASR.ipynb

This is also available as a script for offline long form audio decoding: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_to_text_buffered_infer.py