aboSamoor / pycld2

Apache License 2.0
165 stars 63 forks source link

Silly languages are detected too often. (It's never Klingon.) #6

Open rspeer opened 7 years ago

rspeer commented 7 years ago

It looks like the stock cld2 library has heuristics for detecting rare, silly languages such as Klingon and Pig Latin that are disabled by default. pycld2 enables all of them, and I have reason to believe that the heuristics for the silly ones are not very good, presumably due to lack of training data.

Here are some texts (found on Twitter) that should have been detected as a different language or 'unknown':

>>> pycld2.detect("I reallyreallyreallyreally want to see les mis ob broadway. reallyreally. My dad's on board. My mom is vehemently against it.")

(True,
 123,
 (('X_PIG_LATIN', 'zzp', 50, 561.0),
  ('ENGLISH', 'en', 49, 1314.0),
  ('Unknown', 'un', 0, 0.0)))

>>> pycld2.detect("시카야 생일축하해 ❤ #alwayswithjessicajung #happyjessicaday #iceprincessday")

(True,
 81,
 (('X_PIG_LATIN', 'zzp', 65, 656.0),
  ('Korean', 'ko', 32, 3780.0),
  ('Unknown', 'un', 0, 0.0)))

>>> pycld2.detect('"urghhoh hoh hhhughrgh argh argh argh, arrrrugh haugh hargh? urrugh arrrrugh" - tim allen')

(True,
 85,
 (('X_KLINGON', 'tlh', 98, 524.0),
  ('Unknown', 'un', 0, 0.0),
  ('Unknown', 'un', 0, 0.0)))

>>> pycld2.detect("dicionário de símbolos é o meu segundo dicionário favorito")

(True,
 64,
 (('X_KLINGON', 'tlh', 98, 520.0),
  ('Unknown', 'un', 0, 0.0),
  ('Unknown', 'un', 0, 0.0)))

These two languages seem to be the worst offenders. Fortunately, 'Bork Bork Bork', 'Elmer Fudd', and 'Hacker' seem not to be detected in the wild.

rspeer commented 7 years ago

Volapük (a constructed language that predates Esperanto) also seems to be over-detected.

>>> pycld2.detect("I'm at Fayetteville Free Library (Fayetteville, NY)")

(True,
 50,
 (('VOLAPUK', 'vo', 98, 710.0),
  ('Unknown', 'un', 0, 0.0),
  ('Unknown', 'un', 0, 0.0)))

>>> pycld2.detect("rt mi tl se basa en el evento mayusculas el evento")

(True,
 52,
 (('VOLAPUK', 'vo', 98, 401.0),
  ('Unknown', 'un', 0, 0.0),
  ('Unknown', 'un', 0, 0.0)))
bsolomon1124 commented 5 years ago

@rspeer I would love to be able to help out with this, but since this package is really just Python bindings, this may be better placed at the actual cld2 C++ lib repository.

Any changes made there can be pulled in here.

vslavik commented 4 years ago

This isn't something upstream can address. Upstream already includes both data tables for the 80+ languages battle-tested in Chrome, and "full version" tables that include silly languages.

These tables are in different files (see https://github.com/CLD2Owners/cld2/wiki/CLD2-Full-Version for a list) and it is the choice made here in the Python bindings which files to compile in.

I think @rspeer has a point and the default, more conservative files should be used instead; it is a quite literal case of less is more.