Closed annamine closed 4 years ago
Does temporally unset LC_ALL
as LC_ALL= python3 ../../scripts/spm_encode.py
help?
still same error unfortunately
what if you set LC_ALL=
around snippet:
LC_ALL=
cut -f 2- -d" " $text | \ python3 ../../scripts/spm_encode.py --model=${sentencepiece_model}.model --output_format=piece | \ paste -d" " <(cut -f 1 -d" " $text) - > $token_text cut -f 2- -d" " $token_text > $lmdatadir/$dataset.tokens
LC_ALL=C
I had the same issues for librispeech, I did LANG="" cut -f 2- -d" " $text | \ python3 ../../scripts/spm_encode.py ....
and it worked for me.
Hmm... $LANG
in my environment is en_US.UTF-8
, and I don't have this problem. Maybe you can check your default $LANG
value
My $LANG environment is en_GB.UTF-8 and I also tried to set LANG="" cut -f 2- -d" " $text | \ python3 ../../scripts/spm_encode.py .... but both still returning same error message
I have the same error for Librispeech recipe too
Sorry I am not in your environment so it's not easy for me to debug. I just googled the error message, and all I could find is export LANG=en_US.UTF-8
or export LC_ALL=en_US.UTF-8
, or export PYTHONIOENCODING=utf-8
.
Hi, Thanks a lot for the wonderful tool. I tried to built a model for African language using the WSJ recipe. Language and acoustic model training finished without error after setting LANG=en_US.UTF-8, LC_ALL=en_US.UTF-8, PYTHONIOENCODING=utf-8.
I have now error during decoding, the error message is UnicodeEncodeError: 'ascii' codec cant' encode characters in position 11-14: ordinal not in range(128)
I tried to include various online recommendations for the problem in speech_recognize.py, but could not solve the problem.
Could you help.
Hi, Thanks a lot for the wonderful tool. I tried to built a model for African language using the WSJ recipe. Language and acoustic model training finished without error after setting LANG=en_US.UTF-8, LC_ALL=en_US.UTF-8, PYTHONIOENCODING=utf-8.
I have now error during decoding, the error message is UnicodeEncodeError: 'ascii' codec cant' encode characters in position 11-14: ordinal not in range(128)
I tried to include various online recommendations for the problem in speech_recognize.py, but could not solve the problem.
Could you help.
which line does it happen at?
Hi, Thanks for your prompt reply.
which line does it happen at?
The lines are 297, 293, 39 and 191 as shown in the following message.
loading model(s) from exp/lstm/checkpoint_best.pt:exp/lm_lstm/checkpoint_best.pt
LM fusion with Subword LM
using LM fusion with lm-weight=0.70
0%| | 0/26 [00:00<?, ?it/s]/pytorch/aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead.
Traceback (most recent call last):
File "/home/myt_002/espresso/examples/Tigrigna_E2E_ASR/../../espresso/speech_recognize.py", line 297, in
Best regards,
I would first print T-{}\t{}'.format(utt_id, detok_target_str)
to the screen to see if the string is displayed normally. If yes, then the problem may be when it gets written out to output_file
, then I would try add encoding
argument at line 38 as open(output_path, 'w', buffering=1, encoding='utf-8')
Hi just to give an update, I managed to run the section of code without error now by changing the global paths. Thanks for your help & advice!
Thanks a lot. Adding the encoding argument at line 38 solved the problem. I can now decode without a problem. Thanks a lot.
Do you have a recipe for multilingual training?
Best regards.
Hi just to give an update, I managed to run the section of code without error now by changing the global paths. Thanks for your help & advice!
Cool. What do you mean by "global paths"?
Thanks a lot. Adding the encoding argument at line 38 solved the problem. I can now decode without a problem. Thanks a lot.
Do you have a recipe for multilingual training?
Best regards.
No. I don't have one yet.
Hi just to give an update, I managed to run the section of code without error now by changing the global paths. Thanks for your help & advice!
Cool. What do you mean by "global paths"?
I just needed to modify the path script, for my environment, to use the same en_US-UTF8 to stop the sorting error
Hi, I am trying to run the SWBD recipe on my local machine. I am getting errors at Stage 2 of the run script, building the dictionary and text tokenization. The error seems to be coming from the "tokenizing text for train/valid/test sets..." stage running spm_encode.py.
Code
This is the full shell output:
What have you tried?
My setup should be ok as I have been running the WSJ recipe without issue but I notice that a different script is used here for the tokenizing. Any help or advice would be great!