JATS4R / JATS4R-Participant-Hub

The hub for all JATS4R meeting notes, examples, draft recommendations, documents, and issues.
http://jats4r.org
17 stars 20 forks source link

Suggested recommendation: discourage use of character entity references #1

Closed Klortho closed 9 years ago

Klortho commented 10 years ago

The Publishing Tag Library has this example:

<copyright-statement>Copyright: &copy; 2004 Eichenberger et al.</copyright-statement>

I would like to suggest that this group recommend that character entity references (CERs) such as &copy; be disallowed, or at least discouraged. Instead, the character itself should be used (with proper encoding, of course): ©, or a numeric character reference: &#xa9;.

In my experience, these CERs are a source of endless trouble. The source of the problem is that they require that the parser have access to the DTD. This instantly makes the instance document much harder to make use of. We've been doing this stuff for years, but at PMC, we still struggle at times with new applications, to make sure that they have ready access to the correct DTDs. For those with less XML experience, it can be a real pain point.

So, more generally, I would like to see all JATS instance documents be in pure "standalone" form, and not using CERs is a small part of that. I noticed that Alf's examples don't use them, and contrasting them with the Tag Library example was what made me think of it.

The group might consider this to be out of scope, since it's at the lexical, rather than the syntactic, level. But even if we don't decide to disallow CERs, perhaps we could make sure not to use them in any "best practices" examples.

hubgit commented 10 years ago

Yes, definitely agree - I was going to suggest this as well.

Klortho commented 10 years ago

Jeff points out that this will be difficult to test for, for example, in schematron.

Daniel-Mietchen commented 10 years ago

So we can try to do such tests in other languages, say Java.

Nikos-Markantonatos commented 9 years ago

I am a little skeptical on whether discouraging the use of CERs is indeed a best practice. I do understand the need for a DTD for resolving such characters, but for most people CERs represent mnemonic ascii references to otherwise incomprehensible Unicode references. For example, compare &eacute; with &#x000E9; in terms of readability. Besides, the NLM/JATS DTD Suites only use the ISO CERs and avoid defining their own proprietary CERs to ease on the interchange.

Embedding the actual non-ascii character directly in the XML is a solution, but then again, XML files with embedded non-ascii characters are prone to unnoticed corruption when imported, exported or filtered through non-Unicode applications.

I was once confronted with a huge legacy back-file where all non-ascii Unicode character references were inadvertently turned into &#xFFFFD; without anyone noticing for years, just because some database used to store these XMLs at some point in their long history did not support Unicode or had not been configured for the proper Unicode encoding. This is a risk that you can avoid by using CERs. Combined with readability, CERs form my favorite format of special character representation.

JGilbert-eLife commented 9 years ago

eLife uses numeric character entities rather than CERs and we would agree with discouraging CERs.

vincentml commented 9 years ago

Nikos made some good points, which I agree with. Here we insist on using NCR's or CER's, and validate character encoding using Java, to reduce the chances of corruption and to eliminate ambiguities between similar-looking characters.

If the goal of discouraging the use of CERs is to reduce the reliance on DTDs, there are other areas to consider that are more challenging. One would also need to eliminate attribute default and fixed values and address issues of how whitespace is handled, which can differ if a DTD is not used to inform the infoset.

It seems better to educate users on character encoding in XML and how to set up their software environment to work with JATS. It might also help to press software vendors to support XML catalogs if they don't already.

Klortho commented 9 years ago

It sounds like we have a consensus to at least discourage the use of CERs. Is it okay if we make it a warning?

Klortho commented 9 years ago

Maybe my last comment was wrong. On rereading @Nikos-Markantonatos' and @vincentml's comments above, I see that they have valid points. Yes, my main complaint with them is that they require access to the DTDs.

I can see the need for named entity references, but I really think it's important to ease the pain of requiring a local copy of the DTD to use these documents. So I'd like to suggest:

Klortho commented 9 years ago

We talked about this a little more at the telecon last time, and I did an informal survey of a few XML experts at Balisage, and there seems to be a strong consensus that it's best to move away from dependence on DTDs. In other words, CERs should not be used in the final documents.

Klortho commented 9 years ago

I am trying to write this up now, and I would really like to say that JATS4R recommends not to use CERs, period, since they require processors read the DTDs.

Remember that the focus of JATS4R is to define what is the best way of producing reusable content -- in other words, when there's a tension between what makes things easier for producers and what makes things easier for consumers, our preference should be towards the consumers.

Ideally, XML should become as easy to use as JSON. In fact, I'd venture that DTDs are one of the reasons that XML is dying. That's not to say that we should do this to try to save XML, but just that it is one bit of evidence that XML is too painful to use for developers.

So, can we agree to recommend that CERs not be used in the final, delivered versions of the documents?

Nikos-Markantonatos commented 9 years ago

... there seems to be a strong consensus that it's best to move away from dependence on DTDs. In other words, CERs should not be used in the final documents.

This is fine by me.