NVIDIA / NeMo

A scalable generative AI framework built for researchers and developers working on Large Language Models, Multimodal, and Speech AI (Automatic Speech Recognition and Text-to-Speech)
https://docs.nvidia.com/nemo-framework/user-guide/latest/overview.html
Apache License 2.0
11.84k stars 2.46k forks source link

Problem faced during Training on Different Language. #1330

Closed gopesh97 closed 3 years ago

gopesh97 commented 3 years ago

I am using pre-trained quartznet 15x5, for transfer learning for the Hindi language, with a different set of vocab (Devanagari characters.)

While training I am facing mainly 2 issues:

  1. The reference string that appears during training is blank.
  2. While saving the model using quartznet.save_to('path/to/save') , I am getting 'ascii' codec can't encode character '\u091b' in position 4407: ordinal not in range(128)

Please provide a solution to overcome the abovementioned issues.

Environment overview

Additional context GPU model - 4 x RTX 2080Ti , 12 GB vRAM.

rbracco commented 3 years ago

I am facing similar issues using unicode, can you share some example lines from your manifest? One possibility is that you need to use ensure_ascii=False as an option to json.dump when generating the manifests: json.dump(metadata, f, ensure_ascii=False)

samabdullah commented 3 years ago

@gopesh97 have you got any success on Hindi. I'm working on urdu , just curious to know whether this Nemo model works for right to left language?

marco-radic commented 3 years ago

Ensure that you have the correct language settings in your environment (native or in container). Most ascii encoding issues can be solved by setting the LANG and LC_ALL env variables to a UTF-8 default