Closed ValeryNikiforov closed 3 years ago
Per feature normalization will not be sufficient to perform normalization over small buffer sizes, that's why a fixed mean and std was used for QuartzNet. You can try it, but I would not expect great results.
There are scripts for LM usage with both those types of models - https://docs.nvidia.com/deeplearning/nemo/user-guide/docs/en/main/asr/asr_language_modeling.html
There will also be addition of neural rescoring in the recent future.
@titu1994 Thank you. I think that I can't use original std and mean constants because of different shape ([1,64] in notebook, Citrinet requires [1,80]). Can you share fixed mean and std constants for Citrinet models (in case you have it)?
Also, I met a problem with notebook modification for Citrinet. After changing
asr_model = nemo_asr.models.EncDecCTCModel.from_pretrained('QuartzNet15x5Base-En')
(works ok with per feature normalization)
to
asr_model = nemo_asr.models.ctc_bpe_models.EncDecCTCModelBPE.from_pretrained('stt_en_citrinet_256')
(still using per feature)
output text is always empty (and logits.shape[0]
is 0).
Can you please give me some advice? What could I have missed?
We haven't precomputed these values for the larger datasets that we currently train on.
I don't know the timeline, but @jbalam-nv is working on streaming asr notebook, which mostly doesn't require this pre calculated normalization tensor.
Logits.shape[0] corresponds to batch dimension, how is batch dim 0?
@titu1994 Ok, thank you.
I think I found the problem - after loading Citrinet instead of QuartzNet timestep_duration
is very large (327.68, this is strange..) and n_timesteps_overlap = int(frame_overlap / timestep_duration) - 2
equals to -2.
After that logits[self.n_timesteps_overlap:-self.n_timesteps_overlap]
is empty when I call _greedy_decoder.
n_timesteps_overlap
calculation takes values from Citrinet config. I will try to fix that problem.
Everything works fine when I switch back to QuartzNet.
Citrinet performs 8x downsampling, vs QuartzNet which does 2x downsampling. That would be the source of the problem. The notebook was designed to work with 2x and would need modifications to work with 8x.
@ValeryNikiforov did you succeed in running Citrinet over streaming?
@aayush6897 Yes, I decode small chunks with new audio data (+ left side context: 2-4 seconds) and use beamsearch_ngram after that. For me, text merging is the main problem. I switched to ctcdecode decoder and it's output token's timestamps helped me a lot.
Streaming tutorial that should work with Both Conformer-CTC and Citrinet are now in NeMo: https://github.com/NVIDIA/NeMo/blob/main/tutorials/asr/Streaming_ASR.ipynb
This is also available as a script for offline long form audio decoding: https://github.com/NVIDIA/NeMo/blob/main/examples/asr/speech_to_text_buffered_infer.py
Hello, Thank you for great toolkit, tutorials and models. I have some questions:
I want to use pretrained Citrinet in Online_ASR_Microphone_Demo instead of QuartzNet. I changed normalization to 'per_feature' and initialized EncDecCTCModelBPE. What else do I need to change for the model to work correctly?
Do you plan to release a tutorial for online ASR with LM for Citrinet or QuartzNet?