Maluuba / nlg-eval

Evaluation code for various unsupervised automated metrics for Natural Language Generation.
http://arxiv.org/abs/1706.09799
Other
1.35k stars 224 forks source link

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to <undefined> #55

Closed aaxuluyao closed 5 years ago

aaxuluyao commented 5 years ago

Hi, I've got UnicodeDecodeError when running 'nlg-eval --setup':

Downloading http://nlp.stanford.edu/data/glove.6B.zip to nlgeval/data. Downloading https://raw.githubusercontent.com/robmsmt/glove-gensim/dea5e55f449794567f12c79dc12b7f75339b18ba/glove2word2vec.py to nlgeval/word2vec. Downloading http://www.cs.toronto.edu/~rkiros/models/dictionary.txt to nlgeval/data. Downloading http://www.cs.toronto.edu/~rkiros/models/utable.npy to nlgeval/data. glove2word2vec.py: 100%|██████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<?, ? chunks/s] Downloading http://www.cs.toronto.edu/~rkiros/models/btable.npy to nlgeval/data. dictionary.txt: 550 chunks [00:02, 208 chunks/s] | 0.00/823 [00:00<?, ? chunks/s] Downloading http://www.cs.toronto.edu/~rkiros/models/uni_skip.npz to nlgeval/data. glove.6B.zip: 100%|██████████████████████████████████████████████████████████████| 823/823 [01:18<00:00, 10.5 chunks/s] Downloading http://www.cs.toronto.edu/~rkiros/models/uni_skip.npz.pkl to nlgeval/data. uni_skip.npz.pkl: 100%|███████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<?, ? chunks/s] Downloading http://www.cs.toronto.edu/~rkiros/models/bi_skip.npz to nlgeval/data. bi_skip.npz: 100%|███████████████████████████████████████████████████████████████| 276/276 [07:11<00:00, 1.56s/ chunks] btable.npy: 17%|██████████▏ | 369/2.23k [08:30<50:15, 1.62s/ chunks]Downloading http://www.cs.toronto.edu/~rkiros/models/bi_skip.npz.pkl to nlgeval/data. bi_skip.npz.pkl: 100%|████████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<?, ? chunks/s] Downloading https://raw.githubusercontent.com/moses-smt/mosesdecoder/b199e654df2a26ea58f234cbb642e89d9c1f269d/scripts/generic/multi-bleu.perl to nlgeval/multibleu. multi-bleu.perl: 100%|█████████████████████████████████████████████████████████| 1.00/1.00 [00:00<00:00, 32.0 chunks/s] uni_skip.npz: 100%|██████████████████████████████████████████████████████████████| 634/634 [12:14<00:00, 1.16s/ chunks] btable.npy: 100%|████████████████████████████████████████████████████████████| 2.23k/2.23k [37:41<00:00, 1.16 chunks/s] utable.npy: 100%|████████████████████████████████████████████████████████████| 2.23k/2.23k [39:10<00:00, 1.05s/ chunks] C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") 2019-01-21 18:06:09,764 : MainThread : INFO : 400000 lines with 300 dimensions Traceback (most recent call last): File "nlg-eval.py", line 169, in compute_metrics() File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 722, in call return self.main(*args, kwargs) File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 696, in main with self.make_context(prog_name, args, extra) as ctx: File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 621, in make_context self.parse_args(ctx, args) File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 880, in parse_args value, args = param.handle_parse_result(ctx, opts, args) File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 1404, in handle_parse_result self.callback, ctx, self, value) File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\site-packages\click\core.py", line 78, in invoke_param_callback return callback(ctx, param, value) File "nlg-eval.py", line 141, in setup generate() File "C:\Users\XuluY\nlg-eval-master\nlgeval\word2vec\generate_w2v_files.py", line 26, in generate txt2bin(glove2word2vec(glove_vector_file, output_model_file)) File "C:\Users\XuluY\nlg-eval-master\nlgeval\word2vec\glove2word2vec.py", line 57, in glove2word2vec model_file = prepend_line(glove_vector_file, output_model_file, gensim_first_line) File "C:\Users\XuluY\nlg-eval-master\nlgeval\word2vec\glove2word2vec.py", line 48, in prepend_line for line in old: File "C:\Users\XuluY\AppData\Local\Continuum\anaconda3\envs\tf-gpu\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to

Do you know how to fix this issue?

Thanks and best regards!

juharris commented 5 years ago

Looks like you're running this in Windows using cmd? I think it's a problem with the default encoding. We could look into changing that but I think I have an easier solution.

I assume you're using Git so you should have Git Bash installed? (Search for Git Bash after pressing the Windows key or clicking on the Windows icon in your Taskbar). Try running the setup steps in Git Bash and it should work. LMK otherwise and I'd be happy to help troubleshoot more.

aaxuluyao commented 5 years ago

Thanks Justin. Tried to run it on both my laptop and desktop with Git Bash but still got the same issue:

15002064@ASN-ACA6871 MINGW64 /c/Users/15002064/nlg-eval-master $ python nlg-eval.py --setup Downloading https://raw.githubusercontent.com/robmsmt/glove-gensim/dea5e55f449794567f12c79dc12b7f75339b18ba/glove2word2vec.py to nlgeval/word2vec. glove2word2vec.py: 100%|██████████████████████████████████████████████████████████████| 1.00/1.00 [00:00<?, ? chunks/s] C:\Python36\lib\site-packages\gensim\utils.py:1197: UserWarning: detected Windows; aliasing chunkize to chunkize_serial warnings.warn("detected Windows; aliasing chunkize to chunkize_serial") 2019-01-22 16:40:46,792 : MainThread : INFO : 400000 lines with 300 dimensions Traceback (most recent call last): File "nlg-eval.py", line 169, in compute_metrics() File "C:\Python36\lib\site-packages\click\core.py", line 722, in call return self.main(*args, kwargs) File "C:\Python36\lib\site-packages\click\core.py", line 696, in main with self.make_context(prog_name, args, extra) as ctx: File "C:\Python36\lib\site-packages\click\core.py", line 621, in make_context self.parse_args(ctx, args) File "C:\Python36\lib\site-packages\click\core.py", line 880, in parse_args value, args = param.handle_parse_result(ctx, opts, args) File "C:\Python36\lib\site-packages\click\core.py", line 1404, in handle_parse_result self.callback, ctx, self, value) File "C:\Python36\lib\site-packages\click\core.py", line 78, in invoke_param_callback return callback(ctx, param, value) File "nlg-eval.py", line 141, in setup generate() File "C:\Users\15002064\nlg-eval-master\nlgeval\word2vec\generate_w2v_files.py", line 26, in generate txt2bin(glove2word2vec(glove_vector_file, output_model_file)) File "C:\Users\15002064\nlg-eval-master\nlgeval\word2vec\glove2word2vec.py", line 57, in glove2word2vec model_file = prepend_line(glove_vector_file, output_model_file, gensim_first_line) File "C:\Users\15002064\nlg-eval-master\nlgeval\word2vec\glove2word2vec.py", line 48, in prepend_line for line in old: File "C:\Python36\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 6148: character maps to

When I was using:

python 3.6.5 click 3.6 nltk 3.3 numpy 1.14.5 scikit-learn 0.19.1 gensim 3.4.0 theano 1.0.2 scipy 1.1.0 six 1.12

Best regards

juharris commented 5 years ago

Sorry about that. I confirm that I can reproduce in Git Bash. I swear I had it working before. Just for the record, can you share the output of locale?

juharris commented 5 years ago

I just made a PR that works for me in Git Bash: #56 You should be fine to try it out. Tests passed for me with the usual warnings.