commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.6k stars 527 forks source link

commonmark renderer hangs on certain unicode inputs #27

Closed jgm closed 9 years ago

jgm commented 9 years ago

Found with some help from afl. This is a renderer bug, not a parser bug, as it only affects commonmark (and sometimes man) output.

python3 -c 'print("\uffff")' | cmark -t commonmark
nwellnhof commented 9 years ago

The problem is that utf8proc_iterate is stricter than utf8proc_valid. The former also checks for non-characters (U+FDD0..U+FDEF and the last two code points of every plane). After working with Unicode extensively over the last years, my opinion is that libraries like libcmark shouldn't treat non-characters in a special way. Also see the following questions from the Unicode FAQ:

Q: Are noncharacters prohibited in interchange?

A: This question has led to some controversy, because the Unicode Standard has been somewhat ambiguous about the status of noncharacters. The formal wording of the definition of "noncharacter" in the standard has always indicated that noncharacters "should never be interchanged." That led some people to assume that the definition actually meant "shall not be interchanged" and that therefore the presence of a noncharacter in any Unicode string immediately rendered that string malformed according to the standard. But the intended use of noncharacters requires the ability to exchange them in a limited context, at least across APIs and even through data files and other means of "interchange", so that they can be processed as intended. The choice of the word "should" in the original definition was deliberate, and indicated that one should not try to interchange noncharacters precisely because their interpretation is strictly internal to whatever implementation uses them, so they have no publicly interchangeable semantics. But other informative wording in the text of the core specification and in the character names list was differently and more strongly worded, leading to contradictory interpretations.

Given this ambiguity of intent, in 2013 the UTC issued Corrigendum #9, which deleted the phrase "and that should never be interchanged" from the definition of noncharacters, to make it clear that prohibition from interchange is not part of the formal definition of noncharacters. Corrigendum #9 has been incorporated into the core specification for Unicode 7.0.

Q: Are noncharacters invalid in Unicode strings and UTFs?

A: Absolutely not. Noncharacters do not cause a Unicode string to be ill-formed in any UTF. This can be seen explicitly in the table above, where every noncharacter code point has a well-formed representation in UTF-32, in UTF-16, and in UTF-8. An implementation which converts noncharacter code points between one UTF representation and another must preserve these values correctly. The fact that they are called "noncharacters" and are not intended for open interchange does not mean that they are somehow illegal or invalid code points which make strings containing them invalid.

Q: So how should libraries and tools handle noncharacters?

A: Library APIs, components, and tool applications (such as low-level text editors) which handle all Unicode strings should also handle noncharacters. Often this means simple pass-through, the same way such an API or tool would handle a reserved unassigned code point. Such APIs and tools would not normally be expected to interpret the semantics of noncharacters, precisely because the intended use of a noncharacter is internal. But an API or tool should also not arbitrarily filter out, convert, or otherwise discard the value of noncharacters, any more than they would do for private-use characters or reserved unassigned code points.

(Emphasis mine.)

If it's OK, I'll submit a pull request that removes special handling of non-characters in utf8_iterate which will result in simply passing them through. This should make utf8proc_valid and utf8proc_iterate agree on the validity of code points, solving the root cause of this issue. Checking the return value of utf8proc_iterate is a good idea nevertheless.

jgm commented 9 years ago

+++ Nick Wellnhofer [Apr 16 15 05:13 ]:

If it's OK, I'll submit a pull request that removes special handling of non-characters in utf8_iterate which will result in simply passing them through. This should make utf8proc_valid and utf8proc_iterate agree on the validity of code points, solving the root cause of this issue. Checking the return value of utf8proc_iterate is a good idea nevertheless.

Yes, I like this idea! Please go ahead.