GregBowyer / cld2-cffi

Python bindings to the Compact Language Detector
Apache License 2.0
33 stars 10 forks source link

Fix bytes count on UTF-8 error #19

Closed pquentin closed 6 years ago

pquentin commented 6 years ago

cld2 only accepts a subset of UTF-8, called "Interchange valid UTF-8" (https://tools.ietf.org/html/rfc5198) which does not accept control characters. (We get this error a lot in production.)

When such a character is found, bytes_found is not set, so use the number of bytes instead, as is done in DetectLanguageCheckUTF8.

pquentin commented 6 years ago

This PR depends on #20

GregBowyer commented 6 years ago

Seems reasonable, lets pull this one in.