commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.88k stars 317 forks source link

Characters should be Unicode scalar values not Unicode code points. #369

Open dbuenzli opened 9 years ago

dbuenzli commented 9 years ago

It seems that the standard is defined regardless of the actual Unicode encoding scheme which is a good idea.

However 333c7713fda24fe3444a59b664e3ca55fd45b32c defines a character to be any Unicode code point. To be precise and correct from an Unicode point view characters should be defined as being any Unicode scalar value. Scalar values span all the code points without the surrogate code points, a hack to be able to encode all scalar values in UTF-16 encoding forms.

After having decoded a valid Unicode encoding scheme (i.e. an UTF-X variant) what you get is a sequence of scalar values not a sequence of code points. It is illegal to encode a surrogate code point to UTF-X encoding schemes, only scalar values can be (see definitions D79, D90, D91, D92 in this document). Surrogate code points can be present in the UTF-16 encoding of arbitrary unicode text, but that's precisely an encoding detail, something the document wishes not talk about, once you have decoded these UTF-16 encoding no surrogates will be left in the result.

jgm commented 9 years ago

Thanks. This seems sensible.

+++ Daniel Bünzli [Sep 22 15 03:10 ]:

It seems that the standard is defined regardless of the actual Unicode encoding scheme which is a good idea.

However [1]333c771 defines a character to be any [2]Unicode code point. To be precise and correct from an Unicode point view characters should be defined as being any [3]Unicode scalar value. Scalar values span all the code points without the [4]surrogate code points, a hack to be able to encode all scalar values in UTF-16 encoding forms.

After having decoded a valid Unicode encoding scheme (i.e. an UTF-X variant) what you get is a sequence of scalar values not a sequence of code points. It is illegal to encode a surrogate code point to UTF-X encoding schemes, only scalar values can be (see definitions D79, D90, D91, D92 in [5]this document). Surrogate code points can be present in the UTF-16 encoding of arbitrary unicode text, but that's precisely an encoding detail, something the document wishes not talk about, once you have decoded these UTF-16 encoding no surrogates will be left in the result.

— Reply to this email directly or [6]view it on GitHub.

References

  1. https://github.com/jgm/CommonMark/commit/333c7713fda24fe3444a59b664e3ca55fd45b32c
  2. http://unicode.org/glossary/#code_point
  3. http://unicode.org/glossary/#unicode_scalar_value
  4. http://unicode.org/glossary/#surrogate_code_point
  5. http://www.unicode.org/versions/Unicode7.0.0/ch03.pdf
  6. https://github.com/jgm/CommonMark/issues/369
tin-pot commented 8 years ago

Is "Unicode scalar value" the same as "code point of a Unicode character" then? It is my understanding that hi and lo surrogate code points in the BMP are not characters by definition, hence the question.

So the easiest to understand wording could be to just talk about Unicode characters, while being silent about

  1. that the "more correct" term would probably be "Unicode abstract character";
  2. any form of coded character set and encoding in which the input "text" is presented.

Or just require the input to be in UTF-8?

dbuenzli commented 8 years ago

Is "Unicode scalar value" the same as "code point of a Unicode character" then?

No Unicode scalar values are exactly the Unicode code points minus the surrogate characters, the latters being a hack to be able to encode all scalar values in UTF-{16,16BE,16LE}.

More precisely Unicode scalar values are the integers that you are allowed to encode/decode to/from the unicode transformation formats UTF-{8,16,16BE,16LE,32,32BE,32LE}.

dbuenzli commented 8 years ago

@jgm A few comments on decimal and hexadecimal entities:

Decimal entities consist of &# + a string of 1–8 arabic digits + ;. Again, these entities need to be recognised and transformed into their corresponding unicode codepoints. Invalid unicode codepoints will be replaced by the “unknown codepoint” character (U+FFFD). For security reasons, the codepoint U+0000 will also be replaced by U+FFFD.

It would also be good to clarify in this paragraph that the integer decoded by a decimal or hexadecimal entity must be an Unicode scalar value minus U+0000. The notion of "Invalid unicode code point" is not clear at all in this context.

Side note, XML decimal and hexadecimal entities have further constraints since those have to resolve to a character of the underlying character stream as defined here, it also seems to be the same in HTML which also removes e.g. control characters. But of course it's good that you can express any scalar value of your underlying scalar value stream which is apparently not constrained the way these other standards are in common mark.

Also rather than "unknown codepoint character" you should say the "unicode replacement character". That's its name and what you want to search online to understand what it is.

tin-pot commented 8 years ago

I don't understand this sentence:

No Unicode scalar values are exactly the Unicode code points minus the surrogate characters, the latters being a hack to be able to encode all scalar values in UTF-{16,16BE,16LE}.

How can "Unicode code points" and "[surrogate] characters" have a common super-set, so that you can "subtract" one from the other? Shouldn't that be:

No Unicode scalar values are exactly the Unicode code points except high-surrogate and low-surrogate code points, [...]

what is literally what the Unicode Glossary said (that you referred to).

So it still seems, at least to me, that "surrogate code points" do not denote nor are Unicode (abstract) characters: they are, as you said, an "encoding hack".

Which means that the terms "Unicode scalar value" and "Unicode code point associated with a character" are synonymous after all?

dbuenzli commented 8 years ago

Which means that the terms "Unicode scalar value" and "Unicode code point associated with a character" are synonymous after all?

No because for example the so called reserved code points and non-characters have no abstract characters associated to them but are scalar values. See the "Type of code points" chart in on p.30 in this document.

Both reserved code points (that may be assigned an abstract character in the future, like a new emoji) and non-characters are allowed to be interchanged in the Unicode transformation formats.

In general the term character is so overloaded in Unicode that you better avoid it. Concretely what most programmers are interested in when they are dealing with text interchange are scalar values, that's what they have to process, decode and encode to UTF-X formats.

tin-pot commented 8 years ago

Concretely what most programmers are interested in when they are dealing with text interchange are scalar values, that's what they have to process, decode and encode to UTF-X formats.

Well except when these programmers have to deal with UTF-16 surrogate pairs (for decoding and encoding) of course, in which case not only "scalar values" are interesting.

But I bet you're right, and most programmers would still get the surrogate-pair thing right and would still not know or use the term "scalar value". My point is simply that "scalar value" is a highly special Unicode term, while "Unicode character"---fuzzy or not---confers the intended meaning just fine: the input text is a sequence of Unicode characters. (You'll see below that the XML 1.0 spec uses exactly this cop-out definition.) Whether some of these are "reserved characters", or "unassigned characters", or characters in the "private use area": why would a CommonMark processor give the slightest f**k about it? It is not a "Unicode conformity checker", but a CommonMark processor after all! It can easily get by with the simple rule:


In general the term character is so overloaded in Unicode that you better avoid it.

And I'm absolutely sure that neither I nor anyone else in their right state of mind would "avoid the term character" in a specification of a "plain text" syntax FCOL---and the CommonMark spec is the topic here, after all. Do you honestly suggest to eliminate the word "character" from the CommonMark spec? Really?


It would also be good to clarify in this paragraph that the integer decoded by a decimal or hexadecimal entity must be an Unicode scalar value minus U+0000. The notion of "Invalid unicode code point" is not clear at all in this context.

I'm sorry, this makes no sense for me again:

Why don't you actually read the definitions and specification you link to?


Side note, XML decimal and hexadecimal entities have further constraints since those have to resolve to a character of the underlying character stream as defined here, it also seems to be the same in HTML which also removes e.g. control characters. But of course it's good that you can express with any scalar value your underlying scalar value stream which is apparently not constrained the way these other standards are in common mark.

Yes, of course not all numbers are valid in a numeric character reference, there are (by any reasonable definition) only finitely many characters, you know---if that is what your "side note" means by "further constraints". [And only an entity reference can get resolved, a numeric character reference is simply a way to "enter the character as data". The term named character reference is sometimes used for things like ä: this is in fact an entity reference, where the entity referred to is defined to be a literal containing just a (numeric) character reference: the only difference to a regular (numeric) character reference here is that a name is used, hence the slightly muddled named character reference term.]

In fact: which character numbers (not: code points, which is a Unicode term) such numeric character references can "legally" specify depends---as you vaguely said---on the character set available per definition in the particular (HTML/XML/SGML/XHTML) document.


Which is, in the example case of HTML 4.01, defined in the associated SGML declaration like this, to be (a derivation from) the ISO/IEC 10646-1:1993 character set (in the sense of ISO 8879:1986, ie "a mapping of a character repertoire onto a code set such that each character is associated with its coded representation."):

    CHARSET
          BASESET  "ISO Registration Number 177//CHARSET
                    ISO/IEC 10646-1:1993 UCS-4 with
                    implementation level 3//ESC 2/5 2/15 4/6"
         DESCSET 0       9       UNUSED
                 9       2       9
                 11      2       UNUSED
                 13      1       13
                 14      18      UNUSED
                 32      95      32
                 127     1       UNUSED
                 128     32      UNUSED
                 160     55136   160
                 55296   2048    UNUSED  -- SURROGATES --
                 57344   1056768 57344

So not all control characters are forbidden, one can explicitly use 	, 
, 
. And note that the numbers corresponding to Unicode surrogate code points are explicitly exempted! However, you can't use the C1 set of control characters, or DELETE (U+007F), but from U+00A0 (decimal 160) on you can use any UCS character (but again: not the 2014 numbers for surrogate code points).


For XML 1.0, which is not an SGML application, and because it is no SGML application, the W3C spec simply and "globally" relies on Unicode (or rather ISO 10646:2000 "UCS"):

Definition: A character is an atomic unit of text as specified by ISO/IEC 10646:2000 [ISO/IEC 10646]. Legal characters are tab, carriage return, line feed, and the legal characters of Unicode and ISO/IEC 10646.

Note that the "legal characters" HT, CR, LF are precisely the three control characters allowed by the HTML 4.0 SGML declaration, as we have seen.

And why would the wording "and the legal characters of Unicode and ISO/IEC 10646" be good enough for the official W3C XML specification, but not good enough for the humble CommonMark spec again?


I have no idea how HTML5 treats character sets and encodings (I would guess similar or same as XML), but you can look that one up for yourself.


So to summarize the break-through results of our little discussion so far:

  1. Not every natural number is associated with a character (not in Unicode, and not in any other character set);
  2. You should not specify an input text character by such a number (ie a number not associated with a character).

Duh!


[And with this I'm out of here, which turns more and more into a nit-picking match, not a discussion about flaws and improvements regarding the CommonMark spec.]

dbuenzli commented 8 years ago

Well except when these programmers have to deal with UTF-16 surrogate pairs of course, in which case not only "scalar values" are interesting.

This doesn't happen if you have a proper UTF-X decoder API. Again you will never get surrogates code points out of an UTF-X decoding process.

every non-markup character (ie all non-ASCII characters in the moment) is simply passed through into the output, whatever Unicode character status it has.

That's precisely what I'd like to avoid. We want proper layering of decoders where you have an UTF-X decoder that handles encoding level issues and their errors and outputs a stream of Unicode scalar values to be processed by a common mark decoder that handle common mark issues over a well defined set of Unicode scalar values (the Unicode scalar values or a restriction thereof).

My point is simply that "scalar value" is a highly special Unicode term, while "Unicode character"---fuzzy or not---confers the intended meaning just fine: the input text is a sequence of Unicode characters. (You'll see below that the XML 1.0 spec uses exactly this cop-out definition.)

As a potential implementer I want a specification that is the least ambiguous as possible and that doesn't lead to problems like code point escapes can be for example in the JSON spec where lone surrogates can be found and has undefined behaviour because they don't make any sense.

The right term for this has a formal definition in the Unicode standard which is Unicode scalar value. If the common mark specification wants to says what we mean by a Unicode character is an unicode scalar value or a Unicode code point in the ranges 0x0000 ... 0xD7FF and 0xE000 ... 0x10FFFF I absolutely don't care as long as it chooses the right numbers and makes them precise.

there are no "XML decimal and hexadecimal entities": there are only (1) [various kinds of] entities, (2) entity references like &foobar;, which do refer to these entities (no s**t!), and (3) numeric character references like *,

Yes these are XML character references. I was confused because that's the vocabulary common mark uses, i.e. decimal entities and hexadecimal entities. Sorry not to be a definitional machine. I hope other potential readers will have understood.

a Unicode code point is just an integer (in the range 0 .. 10FFFF16); why would one have to subtract 0 (zero) from it, and what would it mean to "subtract U+0000" by all means?

A range of integer defines a set of integers. In set theory subtraction means removing from a set (a.k.a difference or complement).

Common mark already removes U+0000 from the set of allowed integers in decimal and hexadecimal entities. My comment was that the set of disallowed integers has to be clearly specified and that it should at least remove the surrogates because if common mark defines its "characters" as being scalar values which @jgm seemed to agree then you wouldn't know how to proceed if you get a lone surrogate as an acceptable decimal or hexadecimal entity.

tin-pot commented 8 years ago

This doesn't happen if you have a proper UTF-X decoder API. Again you will never get surrogates code points out of an UTF-X decoding process.

I never doubted that.

every non-markup character (ie all non-ASCII characters in the moment) is simply passed through into the output, whatever Unicode character status it has.

That's precisely what I'd like to avoid. We want proper layering of decoders where you have an UTF-X decoder that handles encoding level issues and their errors and outputs a stream of Unicode scalar values to be processed by a common mark decoder that handle common mark issues over a well defined set of Unicode scalar values (the Unicode scalar values or a restriction thereof).

Yes, decoding (eg from UTF-8, or UTF-16, or ISO 8859-1, or EBCDIC ;-) is done by a "decoder layer", if you want to call it this way. And the result is conceptionally a stream of characters, no matter the representation (as UTF-32 scalar values, or still UTF-8, or whatever). The important thing is that the differences between encodings are done away with after this step (or "above this layer"), and we can (conceptionally) deal with "characters" instead of "codes".

What kind—if any—of diagnostic messages a CommonMark processor (or the "decoder layer" inside) would be required to produce for UTF-X encoding errors: do you really think this belongs into the CommonMark spec? What does the XML or HTML spec says about this? Right: nothing (except giving a non-validating XML processor a free pass to undefined behavior, if I understand sub-section 5.1 correctly).

As a potential implementer I want a specification that is the least ambiguous as possible and that doesn't lead to problems like code point escapes can be for example in the JSON spec where lone surrogates can be found and has undefined behaviour because they don't make any sense.

As a potential implementer of a CommonMark consuming program you wouldn't have to deal with encoding errors, I hope. And the CommonMark spec shouln't have to either. And the set of valid numbers in a numeric character reference is implicitly fixed by the following definition, too (although an extra hint to the spec reader would be helpful).

I yet have to see and understand the problem you have with a definition like this (paraphrased from the XML 1.0 spec):

A CommonMark document is text, ie a sequence of characters. A character is an atomic unit of text as specified by ISO/IEC 10646:2000. Legal characters are

  • tab (HT), carriage return (CR), line feed (LF), and
  • the legal characters of Unicode and ISO/IEC 10646.

The versions of these standards cited here were current at the time this document was prepared. New characters may be added to these standards by amendments or new editions. Consequently, CommonMark processors MUST accept any character in the range specified for Char.

Note that there is no mention of any specific character set or encoding (that's why there is an encoding attribute in the XML declaration) or platform-specific end-of-line or end-of-file marking.

Note also that "the range specified for Char" is:

Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] | [#xE000-#xFFFD]
       | [#x10000-#x10FFFF] /* any Unicode character,
                  excluding the surrogate blocks, FFFE, and FFFF. */

So no mention of "unassigned characters", or "scalar values"; the U+0000 is excluded without much ado as are most other C0 characters (but not C1 and G1!).

Where do you see ambiguity here, or the pressing need to talk about "scalar values", or layered decoders, or encoding errors?

Doesn't the specification fragment satisfy your concern:

If the common mark specification wants to says what we mean by a Unicode character is an unicode scalar value or a Unicode code point in the ranges 0x0000 ... 0xD7FF and 0xE000 ... 0x10FFFF I absolutely don't care as long as it chooses the right numbers and makes them precise.

The above Char definition does exclude even more code points!

Would a hint that an implementation may (or even MUST) refuse an input byte stream which does not decode to a sequence of such Char members make you happy as an implementer?


[NOTE: I would rather say at the "start" of the CommonMark spec that the input is a "document" or "text", which is

and leave questions about encoding, end-of-line representation, and so on "at the door", and in the hands of an implementation. (And the repertoire can simply be defined to be that of ISO/IEC 10464:2000 of course.)]

A single requirement that each CommonMark processor must at least support input in UTF-8, with CR-LF denoting each end of a line (say) could set a reasonable "base level" of encoding support, however. Support of any other encoding would be an implementation choice.


A range of integer defines a set of integers. In set theory subtraction means removing from a set (a.k.a difference or complement).

[A personal remark from someone with kind-of a clue about set theory: It is a looooong jump from writing the words

the integer decoded by a decimal or hexadecimal entity must be an Unicode scalar value minus U+0000.

to then go on lecturing about set theory as if this "minus" had actually occured between phrases denoting sets and would "obviously" have meant "set minus": but there is "the integer" which is "an Unicode scalar value" on the left, and "U+0000" on the right, so I can see no set anywhere (although I now understand what you meant).

So just don't do that, okay?]


PS: I think much of the confusion and debate stems from the fact that while "natural number" and "character" are both abstractions, it feels "natural" to say: "there is a 65 in the next byte", but "too simple" to say "there is a LATIN CAPITAL LETTER A in the next byte". But in both cases we simply do map a bit pattern to an abstract "value" of a certain "type"—the difference is IMO really just in the much bigger set of (conventinal) maps to choose from in the second case.

dbuenzli commented 8 years ago

You are obviously making this harder than it should be and I won't stir this discussion further --- just note that "the legal characters of Unicode" is not a concept you can find formally defined in the Unicode Standard and would thus make the common mark spec very ambiguous.

The formal set of allowed Unicode code points or a well defined named subset of them should be enough, along with the same constraints applied on hex and decimal entities. This is what is missing in the current incarnation of the common mark spec.

I hope my points have been understood by the editors of the spec.

tin-pot commented 8 years ago

just note that "the legal characters of Unicode" is not a concept you can find formally defined in the Unicode Standard and would thus make the common mark spec very ambiguous.

Since this wording would make the CommonMark spec (regarding the use of Unicode) precisely as ambiguous as the W3C Recommendation XML 1.0 (Fifth Edition): I think we could all live with this kind of "very ambiguous specifications" …

You seem to ignore the above definition for the subset Char of code points, which is copied verbatim from the XML spec too—and I honestly can't see any ambiguity there.

That definition would fit CommonMark quite well if more precision in that regard is desired. Alas the only difference in substance it would make is that it would explicitly exclude

  1. most C0 control characters,
  2. the Unicode surrogate code points,
  3. the UTF-16 BOMs.

which is kind-of understood anyway if one talks about "Unicode characters", IMO.

dbuenzli commented 8 years ago

precisely as ambiguous as the W3C Recommendation XML 1.0 (Fifth Edition): I think we could all live with this kind of "very ambiguous specifications" …

Yeah you are right. Let's copy broken specifications, perpetuate undefined vocabulary and avoid precise meaning at all costs.

tin-pot commented 8 years ago

I'm still waiting for your explanation of what exactly is "broken", "undefined", "ambiguous", "without precise meaning" in the given definition for Char, which obviously is what "legal Unicode characters" is supposed to mean.

As we have seen, this definition of Char coincides (with the minor exception of DELETE (U+007F) and the C1 controls)

With that many broken, undefined, unprecise specifications and standards around: isn't it time for you to offer your good advice to the W3C and to the ISO/IEC JTC1/SC34 and to the ITU-T Study Group 17, who all seem to be incapable of drafting a proper specification that satisfies your aspirations?

Or are you kidding? You are, right?

[Or is the absence of the term "Unicode scalar value" in all of the referenced standards and specifications what makes you feel that these are somehow wrong?]

dbuenzli commented 8 years ago

I don't think you are reading what I'm writing.

Please point me to a definition of "the legal characters of Unicode" in the Unicode standard.

[Or is the absence of the term "Unicode scalar value" in all of the referenced standards and specifications what makes you feel that these are somehow wrong?]

They are not wrong, they all list precisely the code points they allow. Again that's exactly and only what I want --- and the inane terminology you suggest of "the legal characters of Unicode" not to be used as it's nowhere defined formally and especially not in the Unicode Standard.