aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.28k stars 337 forks source link

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa0 in position 6503: invalid start byte #237

Open youradds opened 3 years ago

youradds commented 3 years ago

Hi,

I'm trying to work out why this doesn't want to play ball. I'm running with:

polyglot detect --input test.txt

This is the contents of test.txt:

test.txt

Looking at their site, I can see:

https://www.pd-architecture.co.uk/

• 

• doesn't seem to be a valid iso-8859-1 char, which I guess is why its complaining. How do I get around it, without it giving up? Currently it just barfs and doesn't give me any output, even though there is a ton of valid English in there

Thanks

Andy

zufj commented 3 years ago

Hi there,

I had the same issue but as a direct user of pycld2, on which polyglot is built upon.

My temporary workaround suggestion would be to filter out problematic characters. This range was quite successful for me:

lambda s: "".join(i for i in s if 31 < ord(i) < 1114112)

Hope this helps! And try to report it on the other project (same author luckily :))