commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.89k stars 317 forks source link

Entities #442

Open jgm opened 7 years ago

jgm commented 7 years ago

As noted in this thread, it might be desirable to change what the spec says about entities.

Arguably the spec should not require that entities be replaced (in the parsing phase) by unicode characters. A replacement will be necessary for some output formats, but there is no reason why an implementation that only targets HTML should do the replacement at all, and even an implementation that targets multiple formats might choose to handle entities in the renderer, or in an intermediate AST filter. And some implementations might want to preserve entities in the output.

Currently the spec requires replacement for entities in a certain list. It would also simplify things not to have such a list.

jgm commented 7 years ago

Some experiments along these lines in the entities branch of jgm/cmark. This creates a CMARK_NODE_ENTITY node type and does conversions in the man and latex renderers only.

Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes).

tin-pot commented 7 years ago

Looks good so far!

Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes).

Hmm. Isn't a "link title" in CommonMark just a fancy way to write the attribute value literal that ends up in the, well, title attribute?

And given that CommonMark does not even look at, let alone does any conversion or replacement for attribute value literals in "HTML tags" anyway (and nor does Markdown in general, IIRC): leaving the "link title" string alone (maybe apart from checking for literal < and & to guard against XML's touchiness) would seem to simply be consistent and justified behaviour in my view.

Following are some thoughts of mine on this matter.


Lexis

Regarding syntactically recognizing entity and character references, the spec should spell out that references of the "usual" form are recognized.

The following is basically copied from the XML syntax, except for the "&x" vs "&X" alternative in hex character reference.

Note that XML requires the terminating ";" character—and omits the (actual, not what HTML5 "terminology" says it is!) named character reference like &#SPACE; of SGML, which never took off outside SGML.

"See production 64 in annex K.4.1 of the 'Web SGML Adaptations' annex"

"Production 67 'Reference' in W3C REC XML 1.0"

reference = entity reference
          | character reference ;

entity reference = "&" , name , ";" ;

character reference = numeric character reference
                    | hex character reference ;

numeric character reference = "&#" , number , ";" ;

hex character reference = ( "&x" | "&X" ) , hex number , ";" ;

hex number = hex digit , { hex digit } ;

hex digit = Digit | "A".."F" | "a".."f" ;

number = Digit , { Digit } ;

name = name start character , { name character } ;

The character class Digit simply comprises the ten decimal digits, while the name start character and name character classes differ among versions of HTML, XML, etc.

Using the XML definition restricted to ISO 646 (which is what CommonMark currently, implicitly, but incompletely does—eg, it disallows . in tag name) is probably good enough:

"Production 4 'NameStartChar' in W3C REC XML 1.0"

name start character = Letter | ":" | "_" ;

name character = name start character 
               | "-" | "." | Digit ;

Here Letter would just be the basic 52 upper and lower case letters of the ISO 646 repertoire.


In my opinion, a good argument could be made for allowing to omit the terminating ";" in certain cases, because it is either convenient, for example

you could&nbsp&ndash always&nbsp&ndash write like this!

is equivalent to

you could&nbsp;&ndash; always&nbsp;&ndash; write like this!

Or because it allows "joining lines" (exploiting the "lazy continuation line" rule, of course):

you could write about hyphen&shy
ation like this.

is equivalent to

you could write about hyphen&shy;ation like this.

If one defines an entity null with an empty replacement text, this provides an actual "line joining" feature:

you could just join two line to&null
gether like this

is equivalent (after replacement, using <!ENTITY null "">) to

you could just join two line together like this

This is what ISO 8879 SGML has always supported (even in "Minimal SGML Documents"), and I tend to find it useful. But it might be too much for authors accustomed to HTML/XML rules …

The insane decision in the HTML5 "syntax" to allow omitting ";" after some random set of entity names (presumably for compatibility reasons with some existing browsers?) is of course not something one should adopt. I wasn't even aware of that until now! But don't get me started about the HTML5 "syntax" anyway! ;-)


Processing

I agree that the spec should not require (but indeed allow) replacing entity references with (which? whatever?) replacement texts.

And, as I have argued, it seems wise to also forbid replacing numeric character references (at least for the ISO 646 repertoire), to preserve the distinction between eg, | (a literal U+007C VERTICAL LINE) and &#124;. This might be essential for further processing in a tool pipeline.

As far as the spec talks about the parsing result in terms of an AST (or—equivalently?—its representation as a CommonMark-DTD-valid XML document instance), some "entity reference" node type would suffice for unreplaced entity references, similar to your CMARK_NODE_ENTITY node type.

However, it might be useful to include an optional character number just in case that "resolution" of character entity references (in the parser) is desired. The pre-defined XML entities lt, gt, amp, quot, and apos would be obvious candidates for this. In DTD parlance, this node could look like

<!ELEMENT EntityRef EMPTY>
<!-- `name` (a NAME) is the entity name,
     `charnum` (a NUMBER) is the optional UCS code point if this 
     was recognized as a character entity reference -->
<!ATTLIST EntityRef
          name      NMTOKEN  #REQUIRED
          charnum   NMTOKEN  #IMPLIED>

I find placing the entity name in a NAME-typed attribute, alongside the optional code point in a NUMBER-typed one, more natural in XML, which does however only knows NMTOKEN. But of course this doesn't constrain the structure of the CMARK_NODE_ENTITY node.

If the parser would (be allowed to) replace entity references with something other than a Unicode character—that is, really handle general entities, not just character entities—, then the replacement text would directly be inserted (without delimiters or its own node) into the regular character data content, that is: into the CMARK_NODE_TEXT content rsp. the content of the <text> element in the XML representation. (This is consistent with ESIS and XML Infoset rules for "replaced" entities.)

And similarly for character references (lumping numeric and hex together, for this distinction is IMO negligible):

<!ELEMENT CharRef EMPTY>
<!-- `charnum` (a NUMBER) is the decimal UCS code point, whether given in the
     source document as a decimal or hexadecimal numeral -->
<!ATTLIST CharRef
          charnum   NMTOKEN  #REQUIRED>

One could possibly unite the CharRef and EntityRef element/node types into one type, but I'm not sure if I'd like that better. (That's basically what I do in an experimental and hacked-up clone of libsoldout where, in the commonmark branch, I live out my obsession with SGML shorthand syntax …)

"My clone of 'libsoldout' on GitHub"

mity commented 7 years ago

Here's one tricky issue that came up. Ideally, one would leave entities alone in link titles, rather than converting them to characters, at least if that's what one is doing generally. But we really can't do that, since link titles are represented as plain strings (not sequences of inline nodes).

Let me to remind there are more such contexts:

jgm commented 7 years ago

+++ Martin Mitáš [Dec 04 16 13:52 ]:

Let me to remind there are more such contexts:

  • Link title (included for the sake of completeness here)
  • Link destination (see [1]Example 308)
  • Image ALT string (usually rendered differently from links; also note the difference in handling of nested versus non-nested image)
  • Info string in code fence line (see [2]Example 309)

Actually not the Image ALT string (or as we call it the link description), since this is represented in cmark as a list of inlines, and we can just use ENTITY nodes there.

The problem really only arises for the other three contexts, where we just have a raw string.