Closed kklepper closed 7 years ago
Hi @kklepper, this issue is not related directly with detectlanguage-php
client, but more with the detection engine. I have checked detection of señor
and currently it is detected as Spanish (es).
As for Fjord, you are right, it is currently detected as Norwegian, but with very low confidence and it is marked as unreliable. When passing one or few short words detection can be unreliable. To get more reliable results I suggest passing more text, like several sentences.
Detection of short words will be improved in the next detection engine versions.
Thank you! Great!
Your engine works very fine, in general. But after testing a little, I found that there are annoying cases of misinterpretation, one of which is found directly in your instructions.
How can it be that a word like "señor" is interpreted as Portuguese at all? In Portuguese, your example sentence would read "Bom dia senhor".
Likewise, a construction like "Fjord- und Familienpferde", which is clearly German, is interpreted as Norwegian, probably due to the word "Fjord" which takes preference here.
The same is true for the sequence "Fjordpferd, Fjord, Pferde, Pferd, Fjordpferde, Fjordgestüt, Gestüt" which has lots of words nonexistent in Norwegian. Google detects it as German and translates into Norwegian with "Fjord hest, fjord, hester, hest, fjordhest, Fjord Stud, Stud".
I think the 2nd test is proof that "Fjord" is the trigger for the characterization as Norwegian. Acknowledging this, how big is the chance that other words would trigger wrong language characterizations as well?