Problem faced during Training on Different Language.

gopesh97 commented 3 years ago

I am using pre-trained quartznet 15x5, for transfer learning for the Hindi language, with a different set of vocab (Devanagari characters.)

While training I am facing mainly 2 issues:

The reference string that appears during training is blank.
While saving the model using quartznet.save_to('path/to/save') , I am getting 'ascii' codec can't encode character '\u091b' in position 4407: ordinal not in range(128)

Please provide a solution to overcome the abovementioned issues.

Environment overview

Environment location: PyTorch GPU docker
Method of NeMo install: Installed from source

Additional context GPU model - 4 x RTX 2080Ti , 12 GB vRAM.

rbracco commented 3 years ago

I am facing similar issues using unicode, can you share some example lines from your manifest? One possibility is that you need to use ensure_ascii=False as an option to json.dump when generating the manifests: json.dump(metadata, f, ensure_ascii=False)

samabdullah commented 3 years ago

@gopesh97 have you got any success on Hindi. I'm working on urdu , just curious to know whether this Nemo model works for right to left language?

marco-radic commented 3 years ago

Ensure that you have the correct language settings in your environment (native or in container). Most ascii encoding issues can be solved by setting the LANG and LC_ALL env variables to a UTF-8 default

NVIDIA / NeMo

Problem faced during Training on Different Language. #1330