danpovey / pocolm

Small language toolkit for creation, interpolation and pruning of ARPA language models
Other
90 stars 48 forks source link

UnicodeDecodeError in GetNumWords of prepare_int_data.py #110

Closed ma08 closed 2 years ago

ma08 commented 2 years ago

The following error occurs when working on Telugu, Tamil and presumably other languages due to encoding issues :

Traceback (most recent call last):
  File "/home/sourya4/kaldi/egs/tamil_telugu_proj/s5_r3/../../../tools/pocolm/scripts/prepare_int_data.py", line 168, in <module>
    num_words = GetNumWords(args.vocab)
  File "/home/sourya4/kaldi/egs/tamil_telugu_proj/s5_r3/../../../tools/pocolm/scripts/prepare_int_data.py", line 75, in GetNumWords
    universal_newlines=True)
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 425, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.6/subprocess.py", line 850, in communicate
    stdout = self.stdout.read()
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

Fixed by adding encoding='utf-8' to the subprocess.check_output call.

Pull Request #109 submitted with the fix. Not sure if it is comprehensive.

ma08 commented 2 years ago

Closing this issue as the PR is merged.