facebookresearch / KILT

Library for Knowledge Intensive Language Tasks
MIT License
916 stars 91 forks source link

unicode error when running get_triviaqa_input.py #47

Open kevglynn opened 3 years ago

kevglynn commented 3 years ago

Running into this issue when running get_triviaqa_input.py

$ python get_triviaqa_input.py

  1. download TriviaQA original tar.gz file 100%|##########| 2.67G/2.67G [06:28<00:00, 6.87MiB/s]
  2. extract tar.gz file Extracting qa/wikipedia-train.json: 33%|###3 | 1/3 [02:35<05:10, 155.24s/iExtracting qa/wikipedia-dev.json: 33%|###3 | 1/3 [03:34<05:10, 155.24s/it]Extracting qa/wikipedia-test-without-answers.json: 67%|######6 | 2/3 [03:34<0Extracting qa/wikipedia-test-without-answers.json: 100%|##########| 3/3 [03:34<0Extracting qa/wikipedia-test-without-answers.json: 100%|##########| 3/3 [03:34<00:00, 71.58s/it]
  3. remove tar.gz file
  4. getting original questions qa/wikipedia-train.json Traceback (most recent call last): File "...KILT\scripts\get_triviaqa_input.py", line 88, in data = json.load(fin) File "...\miniconda3\lib\json__init__.py", line 293, in load return loads(fp.read(), File "...\miniconda3\lib\encodings\cp1252.py", line 23, in decode return codecs.charmap_decode(input,self.errors,decoding_table)[0] UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 2904: character maps to
erip commented 3 years ago

The default encoding on Windows is cp1252. You'll need to either change your system encoding or update your local copy of kilt to read the file as utf8.