UnicodeDecodeError in GetNumWords of prepare_int_data.py

The following error occurs when working on Telugu, Tamil and presumably other languages due to encoding issues :

Traceback (most recent call last):
  File "/home/sourya4/kaldi/egs/tamil_telugu_proj/s5_r3/../../../tools/pocolm/scripts/prepare_int_data.py", line 168, in <module>
    num_words = GetNumWords(args.vocab)
  File "/home/sourya4/kaldi/egs/tamil_telugu_proj/s5_r3/../../../tools/pocolm/scripts/prepare_int_data.py", line 75, in GetNumWords
    universal_newlines=True)
  File "/usr/lib/python3.6/subprocess.py", line 356, in check_output
    **kwargs).stdout
  File "/usr/lib/python3.6/subprocess.py", line 425, in run
    stdout, stderr = process.communicate(input, timeout=timeout)
  File "/usr/lib/python3.6/subprocess.py", line 850, in communicate
    stdout = self.stdout.read()
  File "/usr/lib/python3.6/encodings/ascii.py", line 26, in decode
    return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe0 in position 0: ordinal not in range(128)

Fixed by adding encoding='utf-8' to the subprocess.check_output call.

Pull Request #109 submitted with the fix. Not sure if it is comprehensive.

danpovey / pocolm

UnicodeDecodeError in GetNumWords of prepare_int_data.py #110