commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.63k stars 544 forks source link

U+FFFE and U+FFFF encoded wrongly #548

Closed nwellnhof closed 5 months ago

nwellnhof commented 5 months ago

cmark_utf8proc_encode_char was pasted from an old version of the utf8proc project and, for whatever reason, contains special handling of U+FFFE and U+FFFF, resulting in invalid serialization of these codepoints. This can be triggered when parsing numeric character references and with some renderers:

% python3 -c 'print(chr(0xFFFF))' |build/src/cmark -t commonmark |hexdump -C
00000000  ff 0a                                             |..|
00000002
% echo '' |build/src/cmark |hexdump -C
00000000  3c 70 3e ff 3c 2f 70 3e  0a                       |<p>.</p>.|
00000009

The expected UTF-8 sequence is EF BF BF.