Mimino666 / langdetect

Port of Google's language-detection library to Python.
Other
1.71k stars 196 forks source link

look's like langdetect is getting fooled by bytes #94

Open Fratso opened 2 years ago

Fratso commented 2 years ago

Hi, I tried to use it as a plaintext detector, to check if it could detect an english sentance from a random deciphered string.

Here's an example:

>>> from langdetect import detect
>>> from langdetect import detect_langs

>>> deciphered_string = b'Q\x04RWUV\x04YTXS\x05RTTPU\x00QYPSURTYSTRW\x04\x05R\x05\x04WVRUQTXQQP\x04R\x07TRT\x02\x04WSVPQRS'
>>> deciphered_string.decode("utf-8")
'Q\x04RWUV\x04YTXS\x05RTTPU\x00QYPSURTYSTRW\x04\x05R\x05\x04WVRUQTXQQP\x04R\x07TRT\x02\x04WSVPQRS'

>>> detect_langs(deciphered_string.decode("utf-8"))
[en:0.999994546875217]
>>> detect(deciphered_string.decode("utf-8"))
'en'

I expected the function to throw an error but not to send a bad result.