Open rspeer opened 7 years ago
Volapük (a constructed language that predates Esperanto) also seems to be over-detected.
>>> pycld2.detect("I'm at Fayetteville Free Library (Fayetteville, NY)")
(True,
50,
(('VOLAPUK', 'vo', 98, 710.0),
('Unknown', 'un', 0, 0.0),
('Unknown', 'un', 0, 0.0)))
>>> pycld2.detect("rt mi tl se basa en el evento mayusculas el evento")
(True,
52,
(('VOLAPUK', 'vo', 98, 401.0),
('Unknown', 'un', 0, 0.0),
('Unknown', 'un', 0, 0.0)))
@rspeer I would love to be able to help out with this, but since this package is really just Python bindings, this may be better placed at the actual cld2 C++ lib repository.
Any changes made there can be pulled in here.
This isn't something upstream can address. Upstream already includes both data tables for the 80+ languages battle-tested in Chrome, and "full version" tables that include silly languages.
These tables are in different files (see https://github.com/CLD2Owners/cld2/wiki/CLD2-Full-Version for a list) and it is the choice made here in the Python bindings which files to compile in.
I think @rspeer has a point and the default, more conservative files should be used instead; it is a quite literal case of less is more.
It looks like the stock cld2 library has heuristics for detecting rare, silly languages such as Klingon and Pig Latin that are disabled by default. pycld2 enables all of them, and I have reason to believe that the heuristics for the silly ones are not very good, presumably due to lack of training data.
Here are some texts (found on Twitter) that should have been detected as a different language or 'unknown':
These two languages seem to be the worst offenders. Fortunately, 'Bork Bork Bork', 'Elmer Fudd', and 'Hacker' seem not to be detected in the wild.