Wrong name for "Wrong ISO-8851-1 Mojibake"

mauntrelio commented 10 years ago

It seems to me it should be "Wrong ISO-8859-1 Mojibake". ISO-8851-1 is an international standard for butter and not for character encoding... (http://www.iso.org/iso/home/store/catalogue_tc/catalogue_detail.htm?csnumber=35218)

mathiasbynens commented 10 years ago

Ah, good catch!

@Boldewyn probably meant ISO-8859-1, for which the canonical name is actually windows-1252 as per the Encoding Standard.

mauntrelio commented 10 years ago

Il 09/05/2014 09:22, Mathias Bynens ha scritto:

Ah, good catch!

I find the website codepoints.net awesome!

Entschulding für meine Deutsche.

Ich find die Website codepoints.net wunderbar und die beste online Ressource über Unicode auf Internet. Ich mache gerade etwas änliches, in Python und WSGI, mit zusätliche Informationen über "confusable", Unicode code chart cross references, idna properties (was is in andere Unicode files ist).

Boldewyn commented 10 years ago

Oh, but I love butter! For real, thanks for the catch!

@mauntrelio if you use Python, take a look at unicodeinfo. That's my approach to fetch the Unicode data automatically, parse it and put it in a SQLite db. Written mostly in Python, and I use it as bootstrap tool for codepoints.net. Ich finde dein Deutsch übrigens sehr gut. Bin sehr auf dein Projekt gespannt.

@mathiasbynens I'm not quite sure the encoding standard is 100% correct here, naming them as synonyms. ISO-8859-1 (not the butter!) is not identical. It's more like ISO-8859-15 sans Euro. The Windows code page then took the unassigned places and just filled them up with what seemed useful (like the double-dagger). The german Wikipedia has it listed quite nicely. However, I'm quite fine with using the label "windows-1252" for it.

mathiasbynens commented 10 years ago

I'm not quite sure the encoding standard is 100% correct here, naming them as synonyms. ISO-8859-1 (not the butter!) is not identical. It's more like ISO-8859-15 sans Euro.

@annevk might be able to explain – he lists ISO-8859-1 as a label for windows-1252, and I’m sure there’s a good reason.

mauntrelio commented 10 years ago

IMHO "Wrong windows-1252 Mojibake" is wrong and should be ISO-8859-1. E.g.: tha page http://codepoints.net/U+4E88 reports the Wrong windows-1252 Mojibake to be the string "äº", but if the 3 byte UTF-8 sequence would be interpreted as Windows-1252 then it should be "äºˆ" , because the third byte (hex 88) in Windows-1252 is the ˆ: MODIFIER LETTER CIRCUMFLEX ACCENT (while in ISO-8859-1 is a control C1 char: HORIZONTAL TABULATION SET which actually is echoed on the page). As far as you're using the utf8_encode PHP function to produce the Mojibake, the produced character is in the Latin1 set (ISO-88591-1) and not in the Windows-Latin1. Windows-1252 and ISO-88591-1 are "quasi" the same, but non exactly the same in the hex range 80-9F. Other issue... (should I post as new... maybe better)...

mauntrelio commented 10 years ago

@mathiasbynens: in the page http://encoding.spec.whatwg.org/ the windows 1252 is also labelled as ascii... I don't think the labels are used as synonism. For a comprehensive and authoritative source for character encoding names and aliases we should refer to the IANA: http://www.iana.org/assignments/character-sets/character-sets.xhtml

mauntrelio commented 10 years ago

@Boldewyn Nice suggestion unicodeinfo, but I don't like the fact that many information (Block names and ranges and Scripts) are buried in the code... I'm trying to write a tool that download the latest release of Unicode from the consortium's web page, parses to data and put it in a MongoDB database, upserting if necessary, so that new versions of the standard can be easily updated.

Boldewyn commented 10 years ago

I agree with the wrong mojibake. Let's keep it here.

annevk commented 10 years ago

@mauntrelio the IANA registry is broken if you care about the web as stated at the top of the Encoding Standard. Yes ISO-8859-1 is an ISO standard, however, on the web it is implemented as windows-1252 as per the Encoding Standard.

It does not matter much, as long as everyone does the same, and new content solely uses utf-8.

mauntrelio commented 10 years ago

@annevk oh! I see.... so if a web page declare itself having a us-ascii or a iso-8859-1 encoding the browser will interpret. e.g. the hex byte 80 as the character of euro (windows-1252)? Sounds a little weird but ok.

annevk commented 10 years ago

The web is weird.

Codepoints / Codepoints.net

Wrong name for "Wrong ISO-8851-1 Mojibake" #28