NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

Working with Citrinet-1024 #2289

Closed rose768 closed 3 years ago

rose768 commented 3 years ago

Hello, Thanks a lot for the powerful ASR toolkit. I am new to ASR. I recently started working with QuartzNet15*5 from this link: but I have long duration acoustic dataset so I saw Citrinet paper at this link and try to use Citrinet for speech recognition. I call Citrinet1024 model insted of quartznet like below: citrinet_model = nemo_asr.models.EncDecCTCModelBPE.from_pretrained(model_name="stt_en_citrinet_1024") but I could not find the config file of stt_en_citrinet_1024 from NEMO GitHub. It has just for 348 and 512 due to this link.

Would you please explain me: 1) What do the filters do in each block of convolution layer in Citrinet? for instance Citrinet1024. what is the benefit of more filters? 2) How does Citrinet work and why it uses from tokenization? 2) How should I use citrinet1024 (implementation) and how to set parameters in the config file of Citrinet1024?

Environment overview

Environment location: Google Colab

Method of NeMo install: !pip install nemo_toolkit[asr]

NeMo version: 1.0.0

Learning Rate: 1e-3

Environment details

OS version : "Ubuntu20.04.3 LTS" PyTorch version : "1.7.1"

titu1994 commented 3 years ago

Do you wish to train Citrinet 1024 from scratch ? Or simply fine tune?

During training, it is not possible to have samples longer than 16-20 seconds per sample, no matter the model. It is only during inference that you would be able to pass much longer audio samples.

The config for 1024 is not shared since it's describes in the paper, the only thing which would need to be changed is the number of filters in the 512 config provided and change dropout to 0.1. I can share one in this thread just for clarity if needed.

To answer your questions -

1) Filters in CNN scales the width of the model - effectively increasing the capacity of the model due to increase in number of parameters. We see significantly improved wer (roughly 4-6%) when moving from 256 filters to 1024.

2) This tutorial explains how tokenizarion helps ASR models - https://colab.research.google.com/github/NVIDIA/NeMo/blob/v1.0.0/tutorials/asr/08_ASR_with_Subword_Tokenization.ipynb

3) The model config is provided inside the instantiated model - once you initialize the model, you'll be able to do cfg = model.cfg to extract the final model config after training. You can print it out, modify it and then create a new model with the modified config if needed.

The training and other config info are available in the paper and the configs provided in the examples/ASR/conf/Citrinet path.

rose768 commented 3 years ago

Do you wish to train Citrinet 1024 from scratch ? Or simply fine tune?

During training, it is not possible to have samples longer than 16-20 seconds per sample, no matter the model. It is only during inference that you would be able to pass much longer audio samples.

The config for 1024 is not shared since it's describes in the paper, the only thing which would need to be changed is the number of filters in the 512 config provided and change dropout to 0.1. I can share one in this thread just for clarity if needed.

To answer your questions -

  1. Filters in CNN scales the width of the model - effectively increasing the capacity of the model due to increase in number of parameters. We see significantly improved wer (roughly 4-6%) when moving from 256 filters to 1024.
  2. This tutorial explains how tokenizarion helps ASR models - https://colab.research.google.com/github/NVIDIA/NeMo/blob/v1.0.0/tutorials/asr/08_ASR_with_Subword_Tokenization.ipynb
  3. The model config is provided inside the instantiated model - once you initialize the model, you'll be able to do cfg = model.cfg to extract the final model config after training. You can print it out, modify it and then create a new model with the modified config if needed.

The training and other config info are available in the paper and the configs provided in the examples/ASR/conf/Citrinet path.

Thanks a lot. what should I put in tokenizer at config file in below: tokenizer: dir: ??? # path to directory which contains either tokenizer.model (bpe) or vocab.txt (for wpe) type: ??? # Can be either bpe or wpe In other words, I do not know what should be as tokens here.

titu1994 commented 3 years ago

Please review the ASR with Subword tokenizarion tutorial on how to build your own Tokenizer - https://colab.research.google.com/github/NVIDIA/NeMo/blob/v1.0.0/tutorials/asr/08_ASR_with_Subword_Tokenization.ipynb

Enescigdem commented 3 years ago

Hello, Thanks for this amazing citrinet model. I want to finetune Citrinet model for my data.How can i do the speech recognition fine-tuning ? Is there any end to end instruction set or example ? Thanks in advance @titu1994

titu1994 commented 3 years ago

I'm preparing a tutorial for finetuning Char and subword models on other languages, it should be released in the coming week or two.

titu1994 commented 3 years ago

Both the ASR tutorials also have* segments for finetuning on same language (eng) in them.

Enescigdem commented 3 years ago

Both the ASR tutorials also hand segments for finetuning on same language (eng) in them.

So you mean current tutorials are enough to finetune the model for my data? Thanks a lot for this quick answer.

titu1994 commented 3 years ago

Yes, the Tutorial 1 and tutorial 8 (char and subword respectively) show how to take a pretrained model and finetuned on AN4 (same language - EN).

Most of the steps remain the same for finetuning on other languages too.