SWBD Recipe Error - Githubissues

annamine commented 4 years ago

Hi, I am trying to run the SWBD recipe on my local machine. I am getting errors at Stage 2 of the run script, building the dictionary and text tokenization. The error seems to be coming from the "tokenizing text for train/valid/test sets..." stage running spm_encode.py.

Code

This is the full shell output:

sentencepiece_trainer.cc(116) LOG(INFO) Running command: --bos_id=-1 --pad_id=0 --eos_id=1 --unk_id=2 --input=data/lang/input --vocab_size=1003 --character_coverage=1.0 --model_type=unigram --model_prefix=data/lang/train_nodup_unigram1000 --input_sentence_size=10000000 --user_defined_symbols=[laughter],[noise],[vocalized-noise]
sentencepiece_trainer.cc(49) LOG(INFO) Starts training with :
TrainerSpec {
  input: data/lang/input
  input_format:
  model_prefix: data/lang/train_nodup_unigram1000
  model_type: UNIGRAM
  vocab_size: 1003
  self_test_sample_size: 0
  character_coverage: 1
  input_sentence_size: 10000000
  shuffle_input_sentence: 1
  seed_sentencepiece_size: 1000000
  shrinking_factor: 0.75
  max_sentence_length: 4192
  num_threads: 16
  num_sub_iterations: 2
  max_sentencepiece_length: 16
  split_by_unicode_script: 1
  split_by_number: 1
  split_by_whitespace: 1
  treat_whitespace_as_suffix: 0
  user_defined_symbols: [laughter]
  user_defined_symbols: [noise]
  user_defined_symbols: [vocalized-noise]
  hard_vocab_limit: 1
  use_all_vocab: 0
  unk_id: 2
  bos_id: -1
  eos_id: 1
  pad_id: 0
  unk_piece: <unk>
  bos_piece: <s>
  eos_piece: </s>
  pad_piece: <pad>
  unk_surface:  ⁇
}
NormalizerSpec {
  name: nmt_nfkc
  add_dummy_prefix: 1
  remove_extra_whitespaces: 1
  escape_whitespaces: 1
  normalization_rule_tsv:
}

trainer_interface.cc(267) LOG(INFO) Loading corpus: data/lang/input
trainer_interface.cc(139) LOG(INFO) Loaded 1000000 lines
trainer_interface.cc(139) LOG(INFO) Loaded 2000000 lines
trainer_interface.cc(114) LOG(WARNING) Too many sentences are loaded! (2416025), which may slow down training.
trainer_interface.cc(116) LOG(WARNING) Consider using --input_sentence_size=<size> and --shuffle_input_sentence=true.
trainer_interface.cc(119) LOG(WARNING) They allow to randomly sample <size> sentences from the entire corpus.
trainer_interface.cc(315) LOG(INFO) Loaded all 2416025 sentences
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <pad>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: </s>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: <unk>
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [laughter]
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [noise]
trainer_interface.cc(330) LOG(INFO) Adding meta_piece: [vocalized-noise]
trainer_interface.cc(335) LOG(INFO) Normalizing sentences...
trainer_interface.cc(384) LOG(INFO) all chars count=120465092
trainer_interface.cc(392) LOG(INFO) Done: 100% characters are covered.
trainer_interface.cc(402) LOG(INFO) Alphabet size=43
trainer_interface.cc(403) LOG(INFO) Final character coverage=1
trainer_interface.cc(435) LOG(INFO) Done! preprocessed 2416025 sentences.
unigram_model_trainer.cc(129) LOG(INFO) Making suffix array...
unigram_model_trainer.cc(133) LOG(INFO) Extracting frequent sub strings...
unigram_model_trainer.cc(184) LOG(INFO) Initialized 166028 seed sentencepieces
trainer_interface.cc(441) LOG(INFO) Tokenizing input sentences with whitespace: 2416025
trainer_interface.cc(451) LOG(INFO) Done! 69957
unigram_model_trainer.cc(470) LOG(INFO) Using 69957 sentences for EM training
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=59852 obj=9.23769 num_tokens=130093 num_tokens/piece=2.17358
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=44412 obj=7.29956 num_tokens=132354 num_tokens/piece=2.98014
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=33308 obj=7.24442 num_tokens=141637 num_tokens/piece=4.25234
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=33303 obj=7.23651 num_tokens=141660 num_tokens/piece=4.25367
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=24977 obj=7.21871 num_tokens=158375 num_tokens/piece=6.34083
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=24977 obj=7.21644 num_tokens=158399 num_tokens/piece=6.34179
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=18732 obj=7.21162 num_tokens=175442 num_tokens/piece=9.3659
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=18732 obj=7.20821 num_tokens=175404 num_tokens/piece=9.36387
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=14049 obj=7.21798 num_tokens=192101 num_tokens/piece=13.6736
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=14049 obj=7.21295 num_tokens=192059 num_tokens/piece=13.6707
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=10536 obj=7.23918 num_tokens=207654 num_tokens/piece=19.709
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=10536 obj=7.23244 num_tokens=207609 num_tokens/piece=19.7047
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=7902 obj=7.27241 num_tokens=221580 num_tokens/piece=28.041
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=7902 obj=7.26387 num_tokens=221484 num_tokens/piece=28.0289
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=5926 obj=7.32839 num_tokens=234743 num_tokens/piece=39.6124
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=5926 obj=7.31716 num_tokens=234693 num_tokens/piece=39.6039
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=4444 obj=7.40817 num_tokens=248571 num_tokens/piece=55.9341
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=4444 obj=7.39317 num_tokens=248418 num_tokens/piece=55.8996
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=3333 obj=7.50897 num_tokens=262750 num_tokens/piece=78.8329
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=3333 obj=7.49001 num_tokens=262534 num_tokens/piece=78.7681
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=2499 obj=7.64161 num_tokens=276859 num_tokens/piece=110.788
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=2499 obj=7.61733 num_tokens=276640 num_tokens/piece=110.7
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1874 obj=7.80273 num_tokens=292799 num_tokens/piece=156.243
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1874 obj=7.77333 num_tokens=292543 num_tokens/piece=156.106
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1405 obj=7.99379 num_tokens=309225 num_tokens/piece=220.089
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1405 obj=7.95503 num_tokens=308821 num_tokens/piece=219.801
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=0 size=1103 obj=8.15973 num_tokens=321388 num_tokens/piece=291.376
unigram_model_trainer.cc(486) LOG(INFO) EM sub_iter=1 size=1103 obj=8.12422 num_tokens=321274 num_tokens/piece=291.273
trainer_interface.cc(507) LOG(INFO) Saving model: data/lang/train_nodup_unigram1000.model
trainer_interface.cc(531) LOG(INFO) Saving vocabs: data/lang/train_nodup_unigram1000.vocab
Traceback (most recent call last):
  File "../../scripts/spm_encode.py", line 99, in <module>
    main()
  File "../../scripts/spm_encode.py", line 90, in main
    print(" ".join(enc_line), file=output_h)
UnicodeEncodeError: 'ascii' codec can't encode character '\u2581' in position 0: ordinal not in range(128)

What have you tried?

My setup should be ok as I have been running the WSJ recipe without issue but I notice that a different script is used here for the tokenizing. Any help or advice would be great!

freewym commented 4 years ago

Does temporally unset LC_ALL as LC_ALL= python3 ../../scripts/spm_encode.py help?

annamine commented 4 years ago

still same error unfortunately

freewym commented 4 years ago

what if you set LC_ALL= around snippet: LC_ALL= cut -f 2- -d" " $text | \ python3 ../../scripts/spm_encode.py --model=${sentencepiece_model}.model --output_format=piece | \ paste -d" " <(cut -f 1 -d" " $text) - > $token_text cut -f 2- -d" " $token_text > $lmdatadir/$dataset.tokens LC_ALL=C

jinpoon commented 4 years ago

I had the same issues for librispeech, I did LANG="" cut -f 2- -d" " $text | \ python3 ../../scripts/spm_encode.py .... and it worked for me.

freewym commented 4 years ago

Hmm... $LANGin my environment is en_US.UTF-8, and I don't have this problem. Maybe you can check your default $LANG value

annamine commented 4 years ago

My $LANG environment is en_GB.UTF-8 and I also tried to set LANG="" cut -f 2- -d" " $text | \ python3 ../../scripts/spm_encode.py .... but both still returning same error message

annamine commented 4 years ago

I have the same error for Librispeech recipe too

freewym commented 4 years ago

Sorry I am not in your environment so it's not easy for me to debug. I just googled the error message, and all I could find is export LANG=en_US.UTF-8 or export LC_ALL=en_US.UTF-8, or export PYTHONIOENCODING=utf-8.

marthayifiru commented 4 years ago

Hi, Thanks a lot for the wonderful tool. I tried to built a model for African language using the WSJ recipe. Language and acoustic model training finished without error after setting LANG=en_US.UTF-8, LC_ALL=en_US.UTF-8, PYTHONIOENCODING=utf-8.

I have now error during decoding, the error message is UnicodeEncodeError: 'ascii' codec cant' encode characters in position 11-14: ordinal not in range(128)

I tried to include various online recommendations for the problem in speech_recognize.py, but could not solve the problem.

Could you help.

freewym commented 4 years ago

Hi, Thanks a lot for the wonderful tool. I tried to built a model for African language using the WSJ recipe. Language and acoustic model training finished without error after setting LANG=en_US.UTF-8, LC_ALL=en_US.UTF-8, PYTHONIOENCODING=utf-8.

I have now error during decoding, the error message is UnicodeEncodeError: 'ascii' codec cant' encode characters in position 11-14: ordinal not in range(128)

I tried to include various online recommendations for the problem in speech_recognize.py, but could not solve the problem.

Could you help.

which line does it happen at?

marthayifiru commented 4 years ago

Hi, Thanks for your prompt reply.

which line does it happen at?

The lines are 297, 293, 39 and 191 as shown in the following message.

loading model(s) from exp/lstm/checkpoint_best.pt:exp/lm_lstm/checkpoint_best.pt LM fusion with Subword LM using LM fusion with lm-weight=0.70 0%| | 0/26 [00:00<?, ?it/s]/pytorch/aten/src/ATen/native/BinaryOps.cpp:81: UserWarning: Integer division of tensors using div or / is deprecated, and in a future release div will perform true division as in Python 3. Use true_divide or floor_divide (// in Python) instead. Traceback (most recent call last):
File "/home/myt_002/espresso/examples/Tigrigna_E2E_ASR/../../espresso/speech_recognize.py", line 297, in cli_main() File "/home/myt_002/espresso/examples/Tigrigna_E2E_ASR/../../espresso/speech_recognize.py", line 293, in cli_main main(args) File "/home/myt_002/espresso/examples/Tigrigna_E2E_ASR/../../espresso/speech_recognize.py", line 39, in main return _main(args, h) File "/home/myt_002/espresso/examples/Tigrigna_E2E_ASR/../../espresso/speech_recognize.py", line 191, in _main print('T-{}\t{}'.format(utt_id, detok_target_str), file=output_file) UnicodeEncodeError: 'ascii' codec can't encode characters in position 11-14: ordinal not in range(128)

Best regards,

freewym commented 4 years ago

I would first print T-{}\t{}'.format(utt_id, detok_target_str) to the screen to see if the string is displayed normally. If yes, then the problem may be when it gets written out to output_file, then I would try add encoding argument at line 38 as open(output_path, 'w', buffering=1, encoding='utf-8')

annamine commented 4 years ago

Hi just to give an update, I managed to run the section of code without error now by changing the global paths. Thanks for your help & advice!

marthayifiru commented 4 years ago

Thanks a lot. Adding the encoding argument at line 38 solved the problem. I can now decode without a problem. Thanks a lot.

Do you have a recipe for multilingual training?

Best regards.

freewym commented 4 years ago

Hi just to give an update, I managed to run the section of code without error now by changing the global paths. Thanks for your help & advice!

Cool. What do you mean by "global paths"?

freewym commented 4 years ago

Thanks a lot. Adding the encoding argument at line 38 solved the problem. I can now decode without a problem. Thanks a lot.

Do you have a recipe for multilingual training?

Best regards.

No. I don't have one yet.

annamine commented 4 years ago

Hi just to give an update, I managed to run the section of code without error now by changing the global paths. Thanks for your help & advice!

Cool. What do you mean by "global paths"?

I just needed to modify the path script, for my environment, to use the same en_US-UTF8 to stop the sorting error

freewym / espresso

SWBD Recipe Error #32

Code

What have you tried?