aboSamoor / polyglot

Multilingual text (NLP) processing toolkit
http://polyglot-nlp.com
Other
2.32k stars 338 forks source link

Error on language detection for some unicode characters (control characters) #71

Open alexgarel opened 8 years ago

alexgarel commented 8 years ago
>>> from polyglot.text import Sentence
>>> Sentence("try \x96 it").words
...
/usr/local/lib/python3.5/dist-packages/polyglot/detect/base.py in detect(self, text)
---> 84     reliable, index, top_3_choices = cld2.detect(t, bestEffort=False)

error: input contains invalid UTF-8 around byte 4 (of 9)

running polyglot 16.07.04 on ubuntu 16:04

alexgarel commented 8 years ago

To find them all:

>>> bads = set()
>>> for i in range(10000):
...     try:
...         Sentence("try %s it" % chr(i)).words
...     except:
...         bads.add(i)
>>> ", ".join(chr(i) for i in sorted(list(bads)))
'\x00, \x01, \x02, \x03, \x04, \x05, \x06, \x07, \x08, \x0b, \x0e, \x0f, \x10, \x11, \x12, \x13, \x14, \x15, \x16, \x17, \x18, \x19, \x1a, \x1b, \x1c, \x1d, \x1e, \x1f, \x7f, \x80, \x81, \x82, \x83, \x84, \x85, \x86, \x87, \x88, \x89, \x8a, \x8b, \x8c, \x8d, \x8e, \x8f, \x90, \x91, \x92, \x93, \x94, \x95, \x96, \x97, \x98, \x99, \x9a, \x9b, \x9c, \x9d, \x9e, \x9f'

My proposition would be either to fix cld2 (is it possible) or to just remove those characters from the sentence before submission for detection.

tindzk commented 7 years ago

To bypass cld2, you can also instantiate Text with the hint_language_code parameter.

motazsaad commented 7 years ago

Hello,

How to catch this error error: input contains invalid UTF-8 around byte ...

The is the exception should I catch ?

Thanks

mamoit commented 7 years ago

Correct me if I'm wrong, but this still seems to be a problem.

alexgarel commented 7 years ago

For the moment on my side, I simply filter out bad characters before submission…

jamesdbaker commented 6 years ago

I believe this is the underlying issue in cld2: https://github.com/mikemccand/chromium-compact-language-detector/issues/22

jamesdbaker commented 6 years ago

I've used the following command to replace control characters from my dataset, using the list of characters provided by @alexgarel above.

sed 's/[\00\01\02\03\04\05\06\07\08\0b\0e\0f\10\11\12\13\14\15\16\17\18\19\1a\1b\1c\1d\1e\1f\7f\80\81\82\83\84\85\86\87\88\89\8a\8b\8c\8d\8e\8f\90\91\92\93\94\95\96\97\98\99\9a\9b\9c\9d\9e\9f]//' input.txt > output.txt

Posting it here in case it's useful for anyone else hitting this problem, but I'm not convinced that the list of characters above is complete as I still have issues on some files.

vldbnc commented 6 years ago

@jamesdbaker you might want to add 'g' switch for multiple substitution.

's/[\00\01\02\03\04\05\06\07\08\0b\0e\0f\10\11\12\13\14\15\16\17\18\19\1a\1b\1c\1d\1e\1f\7f\80\81\82\83\84\85\86\87\88\89\8a\8b\8c\8d\8e\8f\90\91\92\93\94\95\96\97\98\99\9a\9b\9c\9d\9e\9f]//g' input.txt > output.txt

sjlongland commented 5 years ago

@andreoua provided a nice succinct workaround to this pycld issue (see @jamesdbaker's link) which works in Python 3.6…

printable_str = ''.join(x for x in html_str if x.isprintable())

This won't work for Python 2.7 users, but for those of us who have moved forward, there's an easy workaround.

zafercavdar commented 5 years ago

To bypass cld2, you can also instantiate Text with the hint_language_code parameter.

I actually did give hint_language_code but still receiving the same error.

ddelange commented 4 years ago

It's actually only the Cc, Cs and Cn unicode categories that throw this error as far as I can tell. Using regex to remove them as suggested here should do the trick.

import regex

RE_BAD_CHARS = regex.compile(r"[\p{Cc}\p{Cs}\p{Cn}]+")

def remove_bad_chars(text):
    return RE_BAD_CHARS.sub("", text)

remove_bad_chars("A\x96 bad char")  # Cc category
# 'A bad char'

I brute-forced each unicode character through polyglot on py38:

Brute-force script ```py import sys import unicodedata from collections import defaultdict unicode_characters_per_category = defaultdict(list) for c in map(chr, range(sys.maxunicode + 1)): unicode_characters_per_category[unicodedata.category(c)].append(c) all_categories = [ "Cc", # Control 65 "Cf", # Format 161 "Co", # Private Use 0 "Cs", # Surrrogate 0 "Ll", # Lowercase Letter 2,151 "Lm", # Modifier Letter 259 "Lo", # Other Letter 121,414 "Lt", # Titlecase Letter 31 "Lu", # Uppercase Letter 1,788 "Mc", # Spacing Mark 429 "Me", # Enclosing Mark 13 "Mn", # Nonspacing Mark 1,826 "Nd", # Decimal Number 630 "Nl", # Letter Number 236 "No", # Other Number 888 "Pc", # Connector Punctuation 10 "Pd", # Dash Punctuation 24 "Pe", # Close Punctuation 73 "Pf", # Final Punctuation 10 "Pi", # Initial Punctuation 12 "Po", # Other Punctuation 588 "Ps", # Open Punctuation 75 "Sc", # Currency Symbol 62 "Sk", # Modifier Symbol 121 "Sm", # Math Symbol 948 "So", # Other Symbol 6,160 "Zl", # Line Separator 1 "Zp", # Paragraph Separator 1 "Zs", # Space Separator 17 ] from polyglot.text import Text error_cats = set() for cat in all_categories: for char in unicode_characters_per_category[cat]: try: Text(char).words except: error_cats.add(cat) # all categories that errored print(error_cats) ```
ayush-8 commented 3 years ago

To find them all:

>>> bads = set()
>>> for i in range(10000):
...     try:
...         Sentence("try %s it" % chr(i)).words
...     except:
...         bads.add(i)
>>> ", ".join(chr(i) for i in sorted(list(bads)))
'\x00, \x01, \x02, \x03, \x04, \x05, \x06, \x07, \x08, \x0b, \x0e, \x0f, \x10, \x11, \x12, \x13, \x14, \x15, \x16, \x17, \x18, \x19, \x1a, \x1b, \x1c, \x1d, \x1e, \x1f, \x7f, \x80, \x81, \x82, \x83, \x84, \x85, \x86, \x87, \x88, \x89, \x8a, \x8b, \x8c, \x8d, \x8e, \x8f, \x90, \x91, \x92, \x93, \x94, \x95, \x96, \x97, \x98, \x99, \x9a, \x9b, \x9c, \x9d, \x9e, \x9f'

My proposition would be either to fix cld2 (is it possible) or to just remove those characters from the sentence before submission for detection.

Can you let me know, how to remove all these chars, in a single go? (I have a large text file of 20GB)

ddelange commented 3 years ago

@lucifer-it

pip install regex

and then the remove_bad_chars from the snippet above (it does a one-pass replacement)

if the file is too large for your RAM, you can write out a new file in chunks, removing bad characters one chunk at a time in a while loop, e.g. like https://stackoverflow.com/a/61394102/5511061

ned2 commented 2 years ago

I think the cld2 links in this thread may be pointing to the wrong project. Polyglot depends on pycld2, rather than this older port of the cld2 library. Since the former is a fork of the latter, I'm gonna hazard a guess that the bug exists is both projects.

This is the relevant issue in the pycld2 project.

btw @ddelange your brute-force identification of the specific characters that are the root cause is just beautiful :chefs kiss:

christopher-siewert commented 3 weeks ago

I found that I had to regex sub "Non-character" Cn unicode characters to avoid pycld2 errors as well.

error_categories = set()
for c in map(chr, range(sys.maxunicode + 1)):
    try:
        pycld2.detect(c, returnVectors=True)
    except:
        error_categories.add(unicodedata.category(c))
print(error_categories)
# {'Cs', 'Cn', 'Cc'}

I adjusted the solution by @ddelange to use:

RE_BAD_CHARS = regex.compile(r"[\p{Cc}\p{Cs}\p{Cn}]+")
ddelange commented 3 weeks ago

thanks @christopher-siewert, I've updated my solution accordingly