Character encoding of XML documents

Klortho commented 9 years ago

As described here the validator only likes utf-8, with or without a BOM, and utf-16, with a BOM, encodings. I would like to make a JATS4R recommendation that says that for compliance, instance documents have to be one of those encodings.

The rationale is that, first of all, the validator is a good exemplar of an app trying to do something with JATS documents -- if it can't handle something, it's a good indication that it will cause trouble for others. The other rationale is that utf-8 is almost ubiquitous now in Western documents, at least, but, as far as I have read, CKJ applications still use utf-16 a lot -- with Chinese text, for example, utf-8 files are much larger.

All non-unicode encodings should be disallowed, as they're not universally supported.

Melissa37 commented 9 years ago

I agree. Will add this to the call agenda tomorrow. If we could knock this off, we'd have another recommendation finalised :-)

hubgit commented 9 years ago

Sounds good - the recommendation could be to use UTF-8, but that UTF-16 is also acceptable.

hubgit commented 9 years ago

For reference:

Each format has its own set of advantages and disadvantages with respect to storage efficiency (and thus also of transmission time), and processing efficiency. […] Since Unicode code space blocks are organized by character set (i.e. alphabet/script), storage efficiency of any given text effectively depends on the alphabet/script used for that text. So, for example, UTF-8 needs one less byte per character (8 versus 16 bits) than UTF-16 for the 128 code points between U+0000 and U+007F, but needs one more byte per character (24 versus 16 bits) for the 63,488 code points between U+0800 and U+FFFF. Therefore if there are more characters in the range U+0000 to U+007F than there are in the range U+0800 to U+FFFF then UTF-8 is more efficient, while if there are fewer then UTF-16 is more efficient.

https://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings#Efficiency

vincentml commented 9 years ago

It might be useful to include "US-ASCII" (or more precisely "ISO646-US") in the list of recommended encoding. Setting the output encoding to "US-ASCII" or "ISO646-US" in an XSLT causes any extended characters to be encoded as numerical character references in the output XML (at least with Saxon). The advantage of that is that NCR's are easier to identify and they tend to work at times when bare Unicode characters do not.

hubgit commented 9 years ago

It might be useful to include "US-ASCII" (or more precisely "ISO646-US") in the list of recommended encoding.

Converting to US-ASCII might be something that an application does internally, if someone needs to examine the numerical entities (for example), but there's no reason why anyone should be publishing the XML files with US-ASCII encoding.

On the other hand, it doesn't really matter if the XML file is ASCII-encoded (or any other encoding) - the XML parser will still interpret the contents as exactly the same, as long as the encoding is correctly specified in the XML declaration.

they tend to work at times when bare Unicode characters do not.

Is there an example of this that doesn't involve a broken XML parser?

jeffbeckncbi commented 9 years ago

I feel uneasy making a Recommendation that JATS XML files must be encoded in UTF-8 or UTF-16.

But I don't have any problem saying that our online validator will only work on files encoded this way. That is the price you pay for using our tool.

Anyone who downloads and runs the Schematrons will not be testing for the "proper" character encoding, and why should we care?

hubgit commented 9 years ago

I feel uneasy making a Recommendation that JATS XML files must be encoded in UTF-8 or UTF-16.

I think of the recommendations as SHOULD, rather than MUST. If someone has to make a choice, UTF-8 is usually the best for them.

Klortho commented 9 years ago

@vincentml , I agree with Alf in this. US-ASCII is a subset of UTF-8, even if it uses NCRs all over the place, and no code points higher than 127, there's still no reason they can't specify that the encoding is utf-8 in the publicly-available document.

@jeffbeckncbi , my opinion is that this is a really important issue, as XML itself allows a wide variety of character encodings, but that it's a big source of trouble for users -- especially non-technical users.

And note that the validator can be made to "work" with the other encodings, insofar as it can complain that the encoding is wrong, and then give up (same as if the document is not-well-formed).

Folks running just the schematrons on their own servers won't get that validation check, though.

Klortho commented 9 years ago

I think of the recommendations as SHOULD, rather than MUST.

We have both "shoulds" (warnings) and "musts" (errors).

How about if we make UTF-16 an "info" (I'm thinking of the Japanese here, who might not like to see warnings for every document that is encoded correctly, from their perspective) and any encoding other than UTF-8 an error?

jeffbeckncbi commented 9 years ago

I just ran a test to see if I could reference the XML declaration with XPATH (which we will need to do to make a test in the Schematron).

I was not able to do it with

XSL1 using C++ Xalan xsl processor
XSL1 using libxml
XSL1 and Saxon 6.5.5
XSL2 and Saxon 9.1.0.5J

Klortho commented 9 years ago

Yeah, I don't think there would be any way to check the xml declaration in Schematron, or to check the actual encoding itself, of course.

This, and other checks, such as DTD validation, have to be done at a layer before the Schematron. The validator already does a lot of these, now -- have you tried it lately?

ppKrauss commented 9 years ago

... the recommendation is "to use UTF-8", I agree (!)... But about some lang condition? Example: when article/@xml:lang='zh' UTF-16 is a valid alternative. When lang not needs alternative (ex. en, es, pt, etc.), the UTF8 must be the only alternative.

PS: I not see why use JATS as UTF-16 today, because all content that I see, like ex. zh-Wikipedia and cn-Alibaba, use UTF8 to "enconde ML+content" (good compress factor when compared with XML-UTF16).

About ASCII: is a subset of UTF8, so "ASCII is UTF8"... Even when the option is to represent all symbols by numerical entities (ex. &1234;), is important to say UTF8 (!), because the translation of char(1234) will be in a UTF8 context, not another charset context.

Klortho commented 9 years ago

because the translation of char(1234) will be in a UTF8 context, not another charset context.

No, &1234; is unambiguous, no matter what the encoding of the document. It specifies Unicode code point (decimal) 1234 = character U+04D2.

ppKrauss commented 9 years ago

Ops, sorry my final comment. @Klortho, you is correct (!), thanks.

I do my homework to understand why ;-) checking the spec, that say "All XML processors must accept the UTF-8 and UTF-16 encodings of Unicode"... The "context" of any XML character reference is a parsed string, in the XML processor (the processor do the expansion of Ӓ). Is wrong to imagine expansion before, in the unparsed XML string with a specific encoding.

So, the encoding declaration is only about transport medium, not about final interpretation... XML must be interpreted always (!) as UCS – no matter what the encoding, as Klortho say.

Klortho commented 9 years ago

So I just wrote in the draft recs that we recommend that documents be utf-8 (with or without a BOM) or utf-16 (with a BOM).

I think the validator should (nominally) produce an error if the encoding is anything else. I say "nominally", because in fact the validator will probably just get confused with unpredictable errors if the encoding is anything else. And, that's why I think it should be an error: our validator is a litmus test for the re-usability. If our validator has problems, then it's a good sign that other tools will have problems, too.

Klortho commented 9 years ago

Re-open this, if you disagree.

JATS4R / JATS4R-Participant-Hub

Character encoding of XML documents #106