expath / expath-cg

Repository for the W3C EXPath Community Group.
15 stars 6 forks source link

bom, encoding and test EXPath-file-writeText3-002 #70

Open benibela opened 7 years ago

benibela commented 7 years ago

The test EXPath-file-writeText3-002 assumes encoding utf-16 is written as big-endian with BOM.

It could just as well mean little-endian, each with or without BOM.

benibela commented 7 years ago

Also EXPath-file-appendText3-002

michaelhkay commented 7 years ago

There was email correspondence on this subject at the time, see for example

https://lists.w3.org/Archives/Public/public-expath/2012Jul/0005.html

This seemed to reach a level of consensus though I don't think this was well captured in the final spec.

You're free of course to interpret the spec any way you like but we will achieve better interoperability between implementations if implementors respect the test suite as defining a consensus interpretation.

While the relevant RFCs certainly make UTF-16 without a BOM legal, I think there is a strong presumption that the default serialization for UTF-16 should (a) be big-endian, and (b) have a BOM, and I would encourage you to follow these conventions.

Michael Kay Saxonica

On 25 Oct 2016, at 21:45, Benito van der Zander notifications@github.com wrote:

The test EXPath-file-writeText3-002 assumes encoding utf-16 is written as big-endian with BOM.

It could just as well mean little-endian, each with or without BOM.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/expath/expath-cg/issues/70, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSIIgRnNfpu_E-fX76UzEmU-GIQGemDks5q3mpbgaJpZM4Kgdof.

benibela commented 7 years ago

There you wrote

  • file:append-text#3 does not write a BOM

contrary to EXPath-file-appendText3-002

Despite an remark on it <modified by="Christian Grün" on="2013-11-20" change="Alternative without BOM added"/>

While the relevant RFCs certainly make UTF-16 without a BOM legal, I think there is a strong presumption that the default serialization for UTF-16 should (a) be big-endian, and (b) have a BOM, and I would encourage you to follow these conventions.

But I only wanted to deal with HTML. There the WHATWG gave a clear definition, "UTF-16" means always little-endian: https://www.w3.org/TR/encoding/#utf-16le

michaelhkay commented 7 years ago

Wikipedia article on UTF-16 says

If the BOM is missing, RFC 2781 https://tools.ietf.org/html/rfc2781 says that big-endian encoding should be assumed. (In practice, due to Windows using little-endian order by default, many applications similarly assume little-endian encoding by default.)

So it looks as if WhatWG are playing their usual game - ignore standards, just endorse the bugs in existing products.

But we're concerned here with writing of text, not reading. All the specs seem to agree that if you're writing, the most important thing is to include a BOM so that the reader knows what the endianness actually is.

Michael Kay Saxonica

On 26 Oct 2016, at 10:47, Benito van der Zander notifications@github.com wrote:

There you wrote

file:append-text#3 does not write a BOM contrary to EXPath-file-appendText3-002

Despite an remark on it

While the relevant RFCs certainly make UTF-16 without a BOM legal, I think there is a strong presumption that the default serialization for UTF-16 should (a) be big-endian, and (b) have a BOM, and I would encourage you to follow these conventions.

But I only wanted to deal with HTML. There the WHATWG gave a clear definition, "UTF-16" means always little-endian: https://www.w3.org/TR/encoding/#utf-16le https://www.w3.org/TR/encoding/#utf-16le — You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/expath/expath-cg/issues/70#issuecomment-256300118, or mute the thread https://github.com/notifications/unsubscribe-auth/ACSIIuNIDXFF0JC4stKkqwxGa1arbMZEks5q3yHPgaJpZM4Kgdof.

michaelhkay commented 7 years ago

On 26 Oct 2016, at 10:47, Benito van der Zander notifications@github.com wrote:

There you wrote

file:append-text#3 does not write a BOM contrary to EXPath-file-appendText3-002

Well I'm sure my message wasn't the last word on the subject but it's hard to reconstruct the decisions at this distance.

Despite an remark on it

While the relevant RFCs certainly make UTF-16 without a BOM legal, I think there is a strong presumption that the default serialization for UTF-16 should (a) be big-endian, and (b) have a BOM, and I would encourage you to follow these conventions.

But I only wanted to deal with HTML. There the WHATWG gave a clear definition, "UTF-16" means always little-endian: https://www.w3.org/TR/encoding/#utf-16le https://www.w3.org/TR/encoding/#utf-16le

Glory be, everything WhatWG does is weird.

Michael Kay Saxonica

benibela commented 7 years ago

All the specs seem to agree that if you're writing, the most important thing is to include a BOM so that the reader knows what the endianness actually is.

There is also JSON. There the BOM is forbidden: https://tools.ietf.org/html/rfc7159#section-8.1

Well I'm sure my message wasn't the last word on the subject but it's hard to reconstruct the decisions at this distance.

It has been a while.

It seems times are changing, and newer standards have a different opinion

michaelhkay commented 7 years ago

There is also JSON. There the BOM is forbidden: https://tools.ietf.org/html/rfc7159#section-8.1 https://tools.ietf.org/html/rfc7159#section-8.1

Actually, not quite. It says that "implementations" shall not add a BOM. It doesn't say what an "implementation" is. With normal separation of concerns the JSON output will be written as characters, and the encoding to UTF-16 will be done by a library that has no idea that the text it is encoding is JSON, and therefore is under no obligation to conform to the JSON specification.

Michael Kay Saxonica