Open alexgarel opened 8 years ago
To find them all:
>>> bads = set()
>>> for i in range(10000):
... try:
... Sentence("try %s it" % chr(i)).words
... except:
... bads.add(i)
>>> ", ".join(chr(i) for i in sorted(list(bads)))
'\x00, \x01, \x02, \x03, \x04, \x05, \x06, \x07, \x08, \x0b, \x0e, \x0f, \x10, \x11, \x12, \x13, \x14, \x15, \x16, \x17, \x18, \x19, \x1a, \x1b, \x1c, \x1d, \x1e, \x1f, \x7f, \x80, \x81, \x82, \x83, \x84, \x85, \x86, \x87, \x88, \x89, \x8a, \x8b, \x8c, \x8d, \x8e, \x8f, \x90, \x91, \x92, \x93, \x94, \x95, \x96, \x97, \x98, \x99, \x9a, \x9b, \x9c, \x9d, \x9e, \x9f'
My proposition would be either to fix cld2 (is it possible) or to just remove those characters from the sentence before submission for detection.
To bypass cld2, you can also instantiate Text
with the hint_language_code
parameter.
Hello,
How to catch this error error: input contains invalid UTF-8 around byte ...
The is the exception should I catch ?
Thanks
Correct me if I'm wrong, but this still seems to be a problem.
For the moment on my side, I simply filter out bad characters before submission…
I believe this is the underlying issue in cld2: https://github.com/mikemccand/chromium-compact-language-detector/issues/22
I've used the following command to replace control characters from my dataset, using the list of characters provided by @alexgarel above.
sed 's/[\00\01\02\03\04\05\06\07\08\0b\0e\0f\10\11\12\13\14\15\16\17\18\19\1a\1b\1c\1d\1e\1f\7f\80\81\82\83\84\85\86\87\88\89\8a\8b\8c\8d\8e\8f\90\91\92\93\94\95\96\97\98\99\9a\9b\9c\9d\9e\9f]//' input.txt > output.txt
Posting it here in case it's useful for anyone else hitting this problem, but I'm not convinced that the list of characters above is complete as I still have issues on some files.
@jamesdbaker you might want to add 'g' switch for multiple substitution.
's/[\00\01\02\03\04\05\06\07\08\0b\0e\0f\10\11\12\13\14\15\16\17\18\19\1a\1b\1c\1d\1e\1f\7f\80\81\82\83\84\85\86\87\88\89\8a\8b\8c\8d\8e\8f\90\91\92\93\94\95\96\97\98\99\9a\9b\9c\9d\9e\9f]//g' input.txt > output.txt
@andreoua provided a nice succinct workaround to this pycld
issue (see @jamesdbaker's link) which works in Python 3.6…
printable_str = ''.join(x for x in html_str if x.isprintable())
This won't work for Python 2.7 users, but for those of us who have moved forward, there's an easy workaround.
To bypass cld2, you can also instantiate
Text
with thehint_language_code
parameter.
I actually did give hint_language_code but still receiving the same error.
It's actually only the Cc
, Cs
and Cn
unicode categories that throw this error as far as I can tell. Using regex
to remove them as suggested here should do the trick.
import regex
RE_BAD_CHARS = regex.compile(r"[\p{Cc}\p{Cs}\p{Cn}]+")
def remove_bad_chars(text):
return RE_BAD_CHARS.sub("", text)
remove_bad_chars("A\x96 bad char") # Cc category
# 'A bad char'
I brute-forced each unicode character through polyglot
on py38:
To find them all:
>>> bads = set() >>> for i in range(10000): ... try: ... Sentence("try %s it" % chr(i)).words ... except: ... bads.add(i) >>> ", ".join(chr(i) for i in sorted(list(bads))) '\x00, \x01, \x02, \x03, \x04, \x05, \x06, \x07, \x08, \x0b, \x0e, \x0f, \x10, \x11, \x12, \x13, \x14, \x15, \x16, \x17, \x18, \x19, \x1a, \x1b, \x1c, \x1d, \x1e, \x1f, \x7f, \x80, \x81, \x82, \x83, \x84, \x85, \x86, \x87, \x88, \x89, \x8a, \x8b, \x8c, \x8d, \x8e, \x8f, \x90, \x91, \x92, \x93, \x94, \x95, \x96, \x97, \x98, \x99, \x9a, \x9b, \x9c, \x9d, \x9e, \x9f'
My proposition would be either to fix cld2 (is it possible) or to just remove those characters from the sentence before submission for detection.
Can you let me know, how to remove all these chars, in a single go? (I have a large text file of 20GB)
@lucifer-it
pip install regex
and then the remove_bad_chars
from the snippet above (it does a one-pass replacement)
if the file is too large for your RAM, you can write out a new file in chunks, removing bad characters one chunk at a time in a while loop, e.g. like https://stackoverflow.com/a/61394102/5511061
I think the cld2
links in this thread may be pointing to the wrong project. Polyglot depends on pycld2, rather than this older port of the cld2 library. Since the former is a fork of the latter, I'm gonna hazard a guess that the bug exists is both projects.
This is the relevant issue in the pycld2 project.
btw @ddelange your brute-force identification of the specific characters that are the root cause is just beautiful :chefs kiss:
I found that I had to regex sub "Non-character" Cn
unicode characters to avoid pycld2 errors as well.
error_categories = set()
for c in map(chr, range(sys.maxunicode + 1)):
try:
pycld2.detect(c, returnVectors=True)
except:
error_categories.add(unicodedata.category(c))
print(error_categories)
# {'Cs', 'Cn', 'Cc'}
I adjusted the solution by @ddelange to use:
RE_BAD_CHARS = regex.compile(r"[\p{Cc}\p{Cs}\p{Cn}]+")
running polyglot 16.07.04 on ubuntu 16:04