UTF - Githubissues

ruvsv commented 4 years ago

Unfortunately, I have a problem with non-ansii charset. Can u add utf-8 support?

antihutka commented 4 years ago

I fixed vocabulary loading and implemented string decoding for >1 byte tokens. Unicode mode now works fine, and every reasonable tokenization scheme should work too. Please let me know if you hit any other problems, or close this issue if everything is OK.

ruvsv commented 4 years ago

Sorry for the long answer. It's utf text. For torch-rnn everything is ok

python3 train.py --input-h5 /root/data/e_kolt.h5 --input-json /root/data/e_kolt.json --device cuda 2019-08-28 03:37:52,564 - train - INFO - Creating model longest token 4 0-Embedding 1-GRIDGRU 2-GRIDGRU 3-Linear [Embedding(210, 128), GRIDGRU(), GRIDGRU(), Linear(in_features=128, out_features=210, bias=True)] 2019-08-28 03:37:52,572 - train - INFO - Created model with 448722 parameters 2019-08-28 03:37:52,572 - train - INFO - Loading data 2019-08-28 03:37:52,574 - dataloader - INFO - Loaded 76009 items from test 2019-08-28 03:37:52,574 - dataloader - INFO - Loaded 76009 items from val 2019-08-28 03:37:52,575 - dataloader - INFO - Loaded 608075 items from train 2019-08-28 03:37:52,577 - dataloader - INFO - No zeroes found in data, assuming one-based indexes Traceback (most recent call last): File "train.py", line 70, in double_seq_on = [int(x) for x in args.double_seq_on.split(',')] File "train.py", line 70, in double_seq_on = [int(x) for x in args.double_seq_on.split(',')] ValueError: invalid literal for int() with base 10: ''

antihutka commented 4 years ago

This was caused by incorrect parsing of the default --double-seq-on value that I introduced in the last commit. It should be fixed now.

antihutka / pytorch-rnn

UTF #2