fntlnz / cld2-php-ext

:uk: :it: :de: This extension wraps CLD2 (Compact Language Detector 2) that detects over 80 languages in Unicode UTF-8 text.
Apache License 2.0
29 stars 6 forks source link

Cld2 consistency #2

Open mfpierre opened 8 years ago

mfpierre commented 8 years ago

Hi,

I've been using the library and detected some odd behaviors when comparing to other language binding for example this one https://github.com/dachev/node-cld and I'm trying to understand why it behave differently as it's supposed to be the same library behind.

I have a few examples with no hint used in both cases:

Bjr on et 2 personne vou pouver nou recupere a bezon ou pa meci

node-cld:

{ reliable: true,
  textBytes: 63,
  languages: [ { name: 'FRENCH', code: 'fr', percent: 98, score: 379 } ],
  chunks: [] }

cld2-php-ext:

array(5) {
  'language_id' =>
  int(139)
  'language_code' =>
  string(2) "ht"
  'language_name' =>
  string(14) "HAITIAN_CREOLE"
  'language_probability' =>
  int(98)
  'is_reliable' =>
  bool(true)
}

Sin ningun problema pablo

node-cld:

{ reliable: true,
  textBytes: 27,
  languages: [ { name: 'SPANISH', code: 'es', percent: 96, score: 512 } ],
  chunks: [ { name: 'SPANISH', code: 'es', offset: 0, bytes: 25 } ] }

cld2-php-ext:

array(5) {
  'language_id' =>
  int(71)
  'language_code' =>
  string(2) "su"
  'language_name' =>
  string(9) "SUNDANESE"
  'language_probability' =>
  int(96)
  'is_reliable' =>
  bool(true)
}

In both cases the result from the node-cld binding is much more relevant, any clues ?

Thanks

fntlnz commented 8 years ago

Hi @mfpierre Many thanks for reporting. I'm trying to figure out what is causing this weird behaviour.

I noted that compiling the extension against different library versions (actually different commits given the fact that CLD2 doesn't follow anything like a semantic versioning) I obtain different results. In particular, taking this commit (https://github.com/CLD2Owners/cld2/commit/7f791121dc058422ac6bb945b7e655e4ce24f473#diff-ddc9a7436cecaa2e0994d8551ab0a9e5) it gives me the right results for Spanish.

After a few attempts I decided to use the same identical cld2 source files that are used in the Node library and I obtain the same results as you.

I think that this could be an encoding related problem. The Node guys are creating an UTF-8 string before doing anything. In the libcld2 version which they are using there's also a new detection function ExtDetectLanguageSummaryCheckUTF8 that skips non UTF-8 inputs giving UNKNOWN_LANGUAGE in that case.

fntlnz commented 8 years ago

Seem that my hypothesis about UTF-8 is not correct.

I used the ExtDetectLanguageSummaryCheckUTF8 method and UTF-8 is ok:

Is UTF-8 ok? true 
array(5) {
["language_id"]=>
  int(139)
  ["language_code"]=>
  string(2) "ht"
  ["language_name"]=>
  string(14) "HAITIAN_CREOLE"
  ["language_probability"]=>
  int(98)
  ["is_reliable"]=>
  bool(true)
}
mfpierre commented 8 years ago

Thanks for the feedback @fntlnz do you have any clues where it could come from then ?

mentalstring commented 3 years ago

I'm experiencing the same problem. Using the particular commit of cld2 only improved it partially for me.

After running it against a large set of texts, I'm actually seeing too many wrong detections to make this reliable enough for production. Running the same texts against node-cld doesn't have this problems as reported.

Not sure of interest: some of my input texts have different languages in different parts of the text and, oddly, the reported language by cld2-php-ext often seems to be what could be the 2nd best guess (eg, 2nd most used language) — not the main one. In other words, if it's 90% in language A, but 10% in B, cld2-php-ext often reports the text as B. Heck, I've seen large English texts with a single foreign word in it reported as the foreign language. I know it's odd, but I've bump into several of these today. Having the chunks information that node-cld exposes would be handy to better understand this.

Happy to provide some samples if useful, but what is reported on this issue pretty much sums it.

fntlnz commented 3 years ago

Would you be able to do any further debugging to tackle this issue @mentalstring ?

mentalstring commented 3 years ago

@fntlnz Happy to help, but it depends on what you mean with debugging — I'm afraid I can't help much with the code itself, but I'm happy to test it against our large set of texts and provide some oddballs if it helps? Or any pointers on the best way to help out?

Regardless, I went ahead and tried the latest code, but no changes from what is already reported here. Out of the box I got the same results as mentioned on https://github.com/fntlnz/cld2-php-ext/issues/2#issue-143550643. Using https://github.com/CLD2Owners/cld2/commit/7f791121dc058422ac6bb945b7e655e4ce24f473 improved the Spanish sentence like you found (but got Unknown for the other). Also tried with cld2 version used by node-cld and got the same results as in https://github.com/fntlnz/cld2-php-ext/issues/2#issue-143550643.

Let me know if I can be of help somehow.