UTF-8 support in validate_vocab.py

pzelasko commented 5 years ago

The issue is when dealing with a text resource containing non-ASCII symbols - because some environment variables (e.g. LC_ALL=C) are being set somewhere in the middle of running train_lm.py, Python's open() does not correctly detect input encoding.

IMO it's safe to assume that UTF-8 is the standard way of encoding text data today, which is why I'm suggesting to just hard-code it.

Would you be so kind as to merge this or suggest a different approach to supporting non-ASCII characters?

danpovey commented 5 years ago

Does rest of the toolkit deal correctly with UTF-8?

pzelasko commented 5 years ago

It seems to have been the only place needing a fix in order to train an LM on text containing Spanish characters.

danpovey commented 5 years ago

For your change to work correctly on typical systems it would be necessary to change all the shebangs from #!/usr/bin/env python to #!/usr/bin/env python3. But I think there is more to it than that; I had a look at other scripts and I'm not convinced they would work correctly in all cases (actually I'm not sure). What error did you get before you made this change?

pzelasko commented 5 years ago

The error was:

Traceback (most recent call last):
File "/path/to/validate_vocab.py", line 45, in <module>
  for line in f:
File "/usr/lib64/python3.6/encodings/ascii.py", line 26, in decode
  return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 284: ordinal not in range(128)
validate_lm_dir.py: failed to validate /path/to/words.txt
format_arpa_lm.py: failed to validate input LM directory

I think the character which broke it was ñ:

>>> u'ñ'.encode('utf-8')
'\xc3\xb1'

A workaround that worked for me in the past to keep compatibility between python2 and python3 was from io import open - this is python3's open in python2 standard library. Again, my best guess is that some environment variables like LC_ALL, LANG or PYTHONIOENCODING were set somewhere in the scripts calling validate_vocab.py not to support utf-8 by default, but I didn't delve much deeper (quick grep -R LC_ALL showed that it is modified in a few places).

danpovey commented 5 years ago

Usually we set export LC_ALL=C to avoid locale-dependent problems, in Kaldi scripts. I think the best fix would be to change all the shebangs in the setup to invoke python3 not python, and to modify all the 'open' calls to specify utf-8 explicitly. That would make the scripts independent of the locale, which is always a good idea IMO. Do you have time to do that? If not I can ask someone else.

Dan

On Thu, Jan 24, 2019 at 5:42 AM Piotr Żelasko notifications@github.com wrote:

The error was:

Traceback (most recent call last):

File "/path/to/validate_vocab.py", line 45, in

for line in f:

File "/usr/lib64/python3.6/encodings/ascii.py", line 26, in decode

return codecs.ascii_decode(input, self.errors)[0]

UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 284: ordinal not in range(128)

validate_lm_dir.py: failed to validate /path/to/words.txt

format_arpa_lm.py: failed to validate input LM directory

I think the character which broke it was ñ:

u'ñ'.encode('utf-8')

'\xc3\xb1'

A workaround that worked for me in the past to keep compatibility between python2 and python3 was from io import open - this is python3's open in python2 standard library. Again, my best guess is that some environment variables like LC_ALL, LANG or PYTHONIOENCODING were set somewhere in the scripts calling validate_vocab.py not to support utf-8 by default, but I didn't delve much deeper (quick grep -R LC_ALL showed that it is modified in a few places).

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/pull/95#issuecomment-457151646, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu5SeIUBu72fifl-I6kv9S8jT_nJ3ks5vGY4XgaJpZM4aH_i2 .

pzelasko commented 5 years ago

Sure, I'll make the changes as soon as I find a spare moment.

danpovey commented 5 years ago

Thanks!

danpovey commented 5 years ago

@hangruizhe, can you please find some examples in Kaldi of using pocolm, and try to run them with this version? You may have to manually get this branch-- ask someone with more git experience if you don't know, and double check that you got the right branch and are not just testing the master.

My main concern is that it doesn't break our existing setups. It should probably also give the same outputs-- that might be a good thing to check-- although small differences may be acceptable (e.g. in case different python versions have different roundoff behavior).

pzelasko commented 5 years ago

You might want to wait 1-2 days so that I can test it myself first on our own setups (will most likely do it tomorrow) :)

pzelasko commented 5 years ago

It works for me :)

huangruizhe commented 5 years ago

Thanks, I will do the test.

pzelasko commented 5 years ago

Hey @danpovey @huangruizhe, any luck with testing this PR? :)

huangruizhe commented 5 years ago

Thanks for the reminder. I have tested egs/tedlium/s5_r3 and egs/iam/v1, there is no problem, but I realized they are both in English. I think I should run egs/yomdle_tamil/v1 and egs/yomdle_zh/v1 and let you know.

danpovey commented 5 years ago

I think the more compatible one is better, i.e.:

import codecs sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())

On Mon, Mar 4, 2019 at 12:59 PM huangruizhe notifications@github.com wrote:

@huangruizhe commented on this pull request.

In scripts/wordlist_to_vocab.py https://github.com/danpovey/pocolm/pull/95#discussion_r262175015:

@@ -37,7 +37,7 @@ args = parser.parse_args()

read in the weights.

-words = open(args.wordlist, "r").readlines() +words = open(args.wordlist, "r", encoding="utf-8").readlines()

Thanks for the suggestions. Yes, this problem seems to happen only when wordlist_to_vocab.py is invoked by Kaldi's scripts on the clsp grid. We can check in this situation:

sys.stdout.encoding = ANSI_X3.4-1968

Otherwise, if I run wordlist_to_vocab.py alone from the console of the grid, or on my laptop, it runs correctly, with:

sys.stdout.encoding = UTF-8

Thus, I guess Kaldi must have setup something before hand, so the Python3's behavior is affected.

To set sys.stdout.encoding, we might not set the environment variables, like:

export PYTHONIOENCODING=utf-8

as suggested here https://docs.python.org/3/library/sys.html#sys.stdout or here https://stackoverflow.com/a/6362647, because this might cause other problems in the downstream. Instead, we hope to solve this within Python in runtime. Thus, a more suitable way can be:

import codecs sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())

as suggested here https://stackoverflow.com/a/4374457 or here https://stackoverflow.com/a/1169209. I've test this solution and it runs correctly for the moment. Is this what you want? @danpovey https://github.com/danpovey

Another way is like:

sys.stdout.reconfigure(encoding='utf-8')

as suggested here https://stackoverflow.com/a/52372390, but it seems this is new since Python 3.7 and may not be compatible for earlier versions.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/pull/95#discussion_r262175015, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVu8I5OlELoofHJSPSXDff2skPittnks5vTV7-gaJpZM4aH_i2 .

huangruizhe commented 5 years ago

Ok, I will update and re-test the codes.

pzelasko commented 5 years ago

One more idea - maybe setting LC_ALL=C.UTF-8 would help (and then maybe we wouldn't need to fix the encoding manually)?

danpovey commented 5 years ago

I don't like python scripts' behavior to be locale-dependent; also it's a long-standing tradition that all kaldi programs are run with LC_ALL=C.

On Mon, Mar 4, 2019 at 3:35 PM Piotr Żelasko notifications@github.com wrote:

One more idea - maybe setting LC_ALL=C.UTF-8 would help (and then maybe we wouldn't need to fix the encoding manually)?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/pull/95#issuecomment-469411574, or mute the thread https://github.com/notifications/unsubscribe-auth/ADJVuwtWOiqOmdfiigmbDccC3zIdHxR6ks5vTYOWgaJpZM4aH_i2 .

danpovey commented 2 years ago

@hangruizhe after looking at another newer PR on this I notice we never merged this! It seems we somehow lost this thread. It would be a shame to lose all this work.

Were were going to add to the scripts:

import codecs
sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())

and then re-test, but the thread got lost. Another option is to find a UTF-8 locale and set it in the scripts, but I'm not aware of any specific UTF-8 locale that is going to be consistently available, e.g. there is no C.UTF-8 on my mac if I do "locale -a".

huangruizhe commented 2 years ago

I will look into this.

huangruizhe commented 2 years ago

I've added the following snippet to the very front of every *.py file (in this PR). This should fix the error when the default sys.stdout.encoding (as well as stdin and stderr) or PYTHONIOENCODING is not utf-8.

# If the encoding of the default sys.stdout is not utf-8,
# force it to be utf-8. See PR #95.
if hasattr(sys.stdout, 'encoding') and sys.stdout.encoding.lower() != "utf-8":
    import codecs
    sys.stdout = codecs.getwriter("utf-8")(sys.stdout.detach())
    sys.stderr = codecs.getwriter("utf-8")(sys.stderr.detach())
    sys.stdin = codecs.getreader("utf-8")(sys.stdin.detach())

I happened to see other Kaldi scripts also do the same way, e.g., learn_bpe.py

I've tested the updated codes in the environment where I intentionally set: export PYTHONIOENCODING=ANSI_X3.4-1968 and there is no error anymore when running the LM part in the egs/yomdle_tamil/v1 and egs/yomdle_zh/v1 recipes.

huangruizhe commented 2 years ago

@danpovey Do you know how I could update Piotr's PR? It seems I could not push to his branch

danpovey commented 2 years ago

create copy of his branch

On Thursday, January 27, 2022, huangruizhe @.***> wrote:

@danpovey https://github.com/danpovey Do you know how I could update Piotr's PR? It seems I could not push to his branch

— Reply to this email directly, view it on GitHub https://github.com/danpovey/pocolm/pull/95#issuecomment-1022988362, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZFLO6VGBMGUM354ZESSTTUYECZJANCNFSM4GQ77C3A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you were mentioned.Message ID: @.***>

danpovey / pocolm

UTF-8 support in validate_vocab.py #95

@huangruizhe commented on this pull request.

read in the weights.