aadsm / jschardet

Character encoding auto-detection in JavaScript (port of python's chardet)
GNU Lesser General Public License v2.1
710 stars 97 forks source link

unreliable detection - windows1250 #70

Open peminator opened 3 years ago

peminator commented 3 years ago

the windows-1250 is mentioned as Hungarian, but it really is Central European, so it may also be Slovak or Czech text, or maybe even other languages. Proper naming is "Central European". Those accented characters to recognize are for example čČšŠťŤžŽéÉľĽ

Found in VSCode using this, text saved as 1250, on reopen gets detected as 1252, or others 125*, or even as ISO-8859-2 etc. Depends what subset of these nonbasic characters are in the content.

peminator commented 5 months ago

Hey bro, still bad ??? Just tested the new VSCode insiders using jschardet and it DOES NOT DETECT windows1250 AT ALL

check here: https://github.com/microsoft/vscode/pull/208550#issuecomment-2151601674

aadsm commented 4 months ago

Hey! I just saw your comment on https://github.com/microsoft/vscode/pull/208550. Let me take a look into this. Also, don't-brow-me ;P.

aadsm commented 4 months ago

Yeah, that code page is in the group of tests I wasn't able to come up with a string for it. I was only able to test windows-1251.

Thanks for providing one (windows1250.zip). I'm going to use it to create the test and figure out what's going on. Fwiw Sublime Text is also not able to correctly detect it.

I'm less familiar with the eastern european languages. Out of curiosity can you tell me the difference between the two? And when one is used more vs the other?.

peminator commented 4 months ago

Yeah, that code page is in the group of tests I wasn't able to come up with a string for it. I was only able to test windows-1251.

Thanks for providing one (windows1250.zip). I'm going to use it to create the test and figure out what's going on. Fwiw Sublime Text is also not able to correctly detect it.

I'm less familiar with the eastern european languages. Out of curiosity can you tell me the difference between the two? And when one is used more vs the other?.

Afaik Microsoft used this sentence to test fonts (pangram = showcase accented characters) **Příliš žluťoučký kůň úpěl ďábelské ódy.** Thats a sentence in Czech, another country sure using it was slovak, i would add **Päť tôní, ľahký skok** -- just copy the sentences and save it using Windows-1250 in VsCode or other editor which can do it.

I'd suggest just if unsure, if it is one of Windows-1252 or ISO8859-2 detected with any confidence, just add also 1250 to possible results with a bit lower confidence level. Its similarity + characters also mentioned here on wikipedia Windows-1250 So if there is a chance its 1252, theres also chance its 1250.

That way, current users it would still get what expected before, and in VS Code where i could provide multiple quess candidates in recent insiders build, if i configure it to look only for chance of 1250. i would get what i need...

That would be an immediate quick way, with "to be perfected" for later

Also i suggest u rename it, it never was named Hungarian afaik, its official name is "Central European" bc its used to many countries there, naming it after one country may trigger others (exactly to how it triggered me as slovak, bc hungary used to try absorb slovakia in past as part of their hungarian empire, the used to behave with great arogance back then and slovak ppl were opressed , many slovak ppl still feel great hate about them), it never was an open war, but often was not from it, so naming it hungarian feels to me like the "empire strikes back" even long after dismissed ;D

aadsm commented 4 months ago

I was busy on the weekend. I think your suggestion makes sense so I went ahead and implemented just that.

The reason it's named "windows-1250 (Hungarian)" is because it uses a Hungarian language model to predict if the text is in Hungarian. Like you mentioned, other countries used the same encoding, so I imagine that's the reason we're getting no match at all. But it could also be that windows-1252 is just being detected with less characters than windows-1250 needs to come up with any confidence. This is something that I might look into in the future, as it could also affect other encodings.

Ah yeah, the slavic countries have been a source of invasion and dispute for centuries 😬. I didn't follow the breakup of those countries after the fall of the ussr, but still remember yugoslavia and czechoslovakia being on the news. The word slave actually comes from "slav" due to the slavery of slavs that happened during the caliphate :(.