Open robinst opened 8 months ago
Well, first of all, I don’t think the user should do that. No need to encode an en-dash in markdown. And if you do, use –
or –
or –
. It’s been a unicode world for a while now!
CM doesn’t follow all of HTMLs character reference stuff. It doesn’t allow &
without the ;
for example.
As I understand it, the behavior in HTML for mapping those C1 controls to an en-dash and similar, has to do with that their first rule is to not support controls like that, and then instead of using a replacement character, they choose to do something perhaps more useful, use the windows 1252 character. See also: https://github.com/wooorm/character-reference-invalid#what-is-this.
I’d personally be open to doing the same as HTML here. It would mean that every markdown parser has to embed a list though: https://github.com/wooorm/character-reference-invalid/blob/main/index.js. Is it really an improvement to the world when markdown adds workarounds for Windows 1252? I’d rather instead use a replacement character. I understand the current spec as saying that. Along these lines: https://github.com/micromark/micromark/blob/8b08d16891f4d6801eb0c0e990bd7b87836c0826/packages/micromark-util-decode-numeric-character-reference/dev/index.js#L20-L39.
I’d rather instead use a replacement character. I understand the current spec as saying that.
Well the spec says to do this for invalid code points. But 0096 isn't an invalid code point, is it?
(cmark behaves differently yet again, for certain code points, it uses a replacement character instead.)
These code points are the surrogates, they can't be represented for themselves in Unicode text or any of the Unicode transformation formats (e.g. you can't serialize or deserialize a surrogate to/from a valid UTF-8 byte stream). So specifying one as a character reference makes no sense and given that CommonMark never errors the only sensitive thing to do here is to replace them by the Unicode replacement character. HTML does the same.
When I opened https://github.com/commonmark/commonmark-spec/issues/369 I already noticed here that this bit of the spec may benefit of some clarifications as to which code points exactly you are allowed to specify via numeric character references.
Regarding the issue at hand I don't have a strong opinion, but I suspect the HTML definition is rather driven by some kind of "no existing document should be broken constraint" and it would feel slightly odd to replicate that behaviour in CommonMark.
Also I would find it strange to do it for numeric character references but not for literal (raw) occurrences of such characters. But then if we do this for the literals aswell we might loose the ability to directly embed some Unicode text based format/sources in code fences. I tend to find it a good property that CommonMark lets creep in any of the oddest Unicode scalar values (except U+0000), at least in those blocks.
I’d rather instead use a replacement character. I understand the current spec as saying that.
Well the spec says to do this for invalid code points. But 0096 isn't an invalid code point, is it?
I'm not too well-versed on this--and I wouldn't use the term "invalid" myself--but I feel/understand that unicode keeps this space "empty" for compatibility reasons with other encodings. And that html only assigns things to it because they'd be "invalid" otherwise. May understanding may be flawed tho!
Basically these "characters" are there because Unicode to fulfill its duty decided to project every other existing character encoding in it. In particular the code points U+0080
to U+00FF
represent the corresponding IS0-8859-1 (latin1) codes. These include the U+0080-U+009F
of so called C1
control codes. So the space is not really empty, its semantics is well defined (see the chart here).
I'm not sure if any format actually uses these codes with their purported semantics though. So one thing that CommonMark could do is to replace all the C0 and C1 control characters (the Unicode general category Cc) except TAB, CR and LF by the Unicode replacement characters. Because Cc
characters tend to be forbidden in text based formats.
But then sometimes they are not. For example XML is rather liberal, or JSON while it mandates escaping of the C0
control characters does not care about having raw C1
characters in its strings.
So if most Cc
characters were to be replaced by the Unicode replacement character in CommonMark, code blocks could then no longer represent every JSON document out there (it may though represent the majority of useful JSON document :-) and if literal (raw) C1 characters were to be replaced according to HTML definitions mentioned above it would get extremely confusing for those JSON documents in code blocks that have them unescaped in their strings.
Code blocks won’t be affected as this issue is about numeric character references
Code blocks won’t be affected as this issue is about numeric character references
Depends if you also apply the transform on these code points (whether replacement character or surprising html behaviour) to the literal (raw) characters that bare the same code point. This should be done in my opinion, otherwise the character escape system becomes surprising and idiosyncratic for no good reason.
(That being said personally I'm rather convinced that the good answer to the title of this issue is: no).
Hmm, I guess that could be done, but to me seems like it would be a different issue? There’s no such text for that yet in the markdown spec or in HTML:
document.body.textContent = 'a\u0096b'; document.body.textContent.codePointAt(1).toString(16) // 96
Hmm, I guess that could be done, but to me seems like it would be a different issue?
I don't see a different issue. If you are going to solve this issue by deciding to follow the HTML character mapping only on numeric character references, it's good to also reflect on what your character escape system has effectively become for end users (an inconsistent mess if you ask me).
(cmark behaves differently yet again, for certain code points, it uses a replacement character instead.)
These code points are the surrogates
Ah yes, scratch that part (I wasn't reading the code properly; I do know what surrogates are). GitHub does replace –
with a replacement character, see here (but not sure where that happens, might be after cmark):
I agree that adding the table adds some complexity and I'm not sure it's worth it for supporting some legacy way of specifying some characters. (I think I'll close the commonmark-java issue as "won't fix". Can always reopen in case the spec changes.)
oh interesting, there was just a ping on my 5 year old #614. Which is very much related.
The spec says this:
But note that commonmark.js doesn't actually follow that. Take (SPA) control character. However, it is converted to an en-dash –, see dingus.
–
for example. In hex 150 is 96 which would map to U+0096, which is theThe reason for this is that commonmark.js uses the
entities
package which implements the HTML spec. The HTML spec has a replacement table for certain characters, see parsing section:(cmark behaves differently yet again, for certain code points, it uses a replacement character instead.)
Question: Should the spec prescribe the same mapping as the HTML spec? Or should it mention it and make it optional whether an implementation uses the replacement table or not?
(Originally raised here: https://github.com/commonmark/commonmark-java/issues/307#issuecomment-1974796674)