Tokenizer suggestion for fine tuning cache aware streaming model

NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)

https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html

Apache License 2.0

12.07k stars 2.51k forks source link

Tokenizer suggestion for fine tuning cache aware streaming model #9124

Closed rkchamp25 closed 4 months ago

rkchamp25 commented 6 months ago

Hi I want to fine tune "stt_en_fastconformer_hybrid_large_streaming_multi" on custom data. In my dataset I have things like "Vitamin B12", "Code: c12r5", "hb1ac" etc For these alphanumeric words:

Should I convert the above to "vitamin b twelve", "code c one two r five", "h b one a c" ..... for using the default tokenizer
Should I create a custom/new tokenizer for this?

If there is any other suggestion, please let me know. Thank You

titu1994 commented 6 months ago

If you want to finetune using the original tokenizer, yes you'll need to normalize all numbers to spoken words.

Changing tokenizer means you'll need a large amount of data to retrain the model, that is not suggested unless you have several thousand hours of speech to reach best results

bfss commented 5 months ago

If you want to finetune using the original tokenizer, yes you'll need to normalize all numbers to spoken words.

Changing tokenizer means you'll need a large amount of data to retrain the model, that is not suggested unless you have several thousand hours of speech to reach best results

How to use the original tokenizer? I also created a discussion for this.

github-actions[bot] commented 4 months ago

This issue is stale because it has been open for 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

github-actions[bot] commented 4 months ago

This issue was closed because it has been inactive for 7 days since being marked as stale.