commonmark / commonmark-spec

CommonMark spec, with reference implementations in C and JavaScript
http://commonmark.org
Other
4.88k stars 316 forks source link

libcmark doesn't validate UTF-8 (security) #213

Closed nwellnhof closed 9 years ago

nwellnhof commented 9 years ago

libcmark doesn't validate its UTF-8 input and passes on invalid UTF-8 byte sequences like overlong encodings:

$ echo -e '\xc0\x80' |build/src/cmark |hexdump -C
00000000  3c 70 3e c0 80 3c 2f 70  3e 0a                    |<p>..</p>.|
0000000a

Or other invalid byte sequences:

$ echo -e 'a\x80b' |build/src/cmark |hexdump -C
00000000  3c 70 3e 61 80 62 3c 2f  70 3e 0a                 |<p>a.b</p>.|
0000000b

Invalid UTF-8 might be used to bypass security validations, so this is a security risk.

nwellnhof commented 9 years ago

Some more notes: There already is some limited validation done implictly by utf8proc_detab via its use of utf8proc_charlen. Byte sequences starting with \x80 are accepted because of an off-by-one error in utf8proc_detab. Overlong encodings, codepoints above 0x110000, surrogates, and non-characters are handled in utf8proc_iterate but this function isn't used by utf8proc_detab. Non-characters are not invalid UTF-8, so I don't see a need to check for them.