Mimino666 / langdetect

Port of Google's language-detection library to Python.
Other
1.71k stars 196 forks source link

detect() not picking top language #16

Closed haydenth closed 8 years ago

haydenth commented 8 years ago

Having an issue here. Should detect() always return the highest probability language?

>>> x = '''Imagenes accesibles a todos en Twitter'''
>>> import langdetect
>>> langdetect.detect_langs(x)
[es:0.857137818744, en:0.142858598247]
>>> langdetect.detect(x)
'es'
>>> x
'Imagenes accesibles a todos en Twitter'
>>> langdetect.detect(x)
'en'
>>> langdetect.detect_langs(x)
[es:0.714283198255, en:0.285714266874]
>>> langdetect.detect(x)
'en'
>>> langdetect.detect(x)
'es'
>>> langdetect.detect(x)
'en'
>>> langdetect.detect(x)
'es'
>>> langdetect.detect(x)
'en'
Mimino666 commented 8 years ago

Yes it should. But the language detection is non-deterministic, which is part of a design from the original Google project (see #3).

In your case, the message is too short and ambiguous, so sometimes "es" wins and sometimes "en" wins.