allenai / ir_datasets

Provides a common interface to many IR ranking datasets.

https://ir-datasets.com/

Apache License 2.0

314 stars 42 forks source link

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128) #151

Closed catqaq closed 2 years ago

catqaq commented 2 years ago

Describe the bug UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 936: ordinal not in range(128)

Affected dataset(s) 'msmarco-passage/train' To Reproduce Steps to reproduce the behavior: Just run the official demo code: `import ir_datasets

if name == "main": dataset = ir_datasets.load('msmarco-passage/train')

Documents

for doc in dataset.docs_iter():
    print(doc)

Expected behavior get normal output

Additional context Add any other context about the problem here.

seanmacavaney commented 2 years ago

Thanks for reporting @catqaq. The code works for me, so I suspect it's a platform-specific problem. Can you confirm your operating system?

If you set the PYTHONUTF8=1 environment variable, do you no longer get the error?

If this the above works, I think I would be able to fix it by specifying utf8 encoding in places where it's not currently specified (e.g., the TextIOWrappers here).

catqaq commented 2 years ago

Thanks! Here is my environment info (from "transformers-cli env"):

transformers version: 4.15.0
Platform: Linux-3.10.0-957.el7.x86_64-x86_64-with-debian-buster-sid
Python version: 3.6.13
PyTorch version (GPU?): 1.10.1+cu102 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

I tried set the PYTHONUTF8=1 environment variable followed by https://stackoverflow.com/questions/50933194/how-do-i-set-the-pythonutf8-environment-variable-to-enable-utf-8-encoding-by-def. But i still got the same error.

catqaq commented 2 years ago

I don't know why setting the PYTHONUTF8=1 environment variable did not work, but setting utf8 in TextIOWrapper works for me. if self.stream is None: if isinstance(self.dlc, list): self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc[self.stream_idx].stream()),encoding='utf-8') else: self.stream = io.TextIOWrapper(self.ctxt.enter_context(self.dlc.stream()),encoding='utf-8') oh, the code display is a bit messy.

seanmacavaney commented 2 years ago

Fascinating that PYTHONUTF8 doesn't work-- thanks for testing the encoding fix.

Out of curiosity, can you test:

import locale
print(locale.getpreferredencoding())

catqaq commented 2 years ago

Fascinating that PYTHONUTF8 doesn't work-- thanks for testing the encoding fix.

Out of curiosity, can you test:
import locale
print(locale.getpreferredencoding())

oh，i got ANSI:

So i tried to fix this followed by https://stackoverflow.com/questions/44344458/why-does-locale-getpreferredencoding-return-ansi-x3-4-1968-instead-of-utf-8.

apt install locales-all
export LANG="en_US.UTF-8"

Then i got the utf8 encoding:

seanmacavaney commented 2 years ago

Gotcha- thanks! I've opened a PR (#152) that should properly set the encoding everywhere. I'd like to test it a bit more before merging though because it touches a lot of files. (It looks like in most situations it probably doesn't matter though, given the type of data stored in the files.)

catqaq commented 2 years ago

Okay, we've been suffering without some easy-to-use IR dataset interface for a long time, thanks for your excellent work!

seanmacavaney commented 2 years ago

Thanks!