BYVoid / uchardet

An encoding detector library ported from Mozilla
Other
605 stars 106 forks source link

uchardet wrongly determines the text as WINDOWS-1252 #42

Closed lemonl2 closed 5 years ago

lemonl2 commented 5 years ago

the file name is 123.txt, that content is "hour时.txt" or "hour间.txt" , uchardet determines the file charset is "WINDOWS-1252", but actual is "UTF-8", could you help this ?

Jehan commented 5 years ago

This has not been the main repository for uchardet for years now, as is written on the README. Please do not report bugs here but on the freedesktop bug tracker: https://gitlab.freedesktop.org/uchardet/uchardet

In any case, for your specific issue, there won't be a solution anyway. No system in the world can determine with certainty a 9-character text mixing 2 languages, and without actual meaning! It's like this text is gathering all the reasons to make it impossible to detect.

uchardet works statistically, detecting character usage, but also character sequence usage. This way, it detects a (charset, language) couple. For this to work efficiently, it needs long enough text (a full sentence at least) in a known language. It is not meant to recognize words, and worse 2 words of different languages/character sets.

lemonl2 commented 5 years ago

This has not been the main repository for uchardet for years now, as is written on the README. Please do not report bugs here but on the freedesktop bug tracker: https://gitlab.freedesktop.org/uchardet/uchardet

In any case, for your specific issue, there won't be a solution anyway. No system in the world can determine with certainty a 9-character text mixing 2 languages, and without actual meaning! It's like this text is gathering all the reasons to make it impossible to detect.

uchardet works statistically, detecting character usage, but also character sequence usage. This way, it detects a (charset, language) couple. For this to work efficiently, it needs long enough text (a full sentence at least) in a known language. It is not meant to recognize words, and worse 2 words of different languages/character sets.

got it! thank you!