Closed nwellnhof closed 9 years ago
Some more notes: There already is some limited validation done implictly by utf8proc_detab
via its use of utf8proc_charlen
. Byte sequences starting with \x80
are accepted because of an off-by-one error in utf8proc_detab
. Overlong encodings, codepoints above 0x110000
, surrogates, and non-characters are handled in utf8proc_iterate
but this function isn't used by utf8proc_detab
. Non-characters are not invalid UTF-8, so I don't see a need to check for them.
libcmark
doesn't validate its UTF-8 input and passes on invalid UTF-8 byte sequences like overlong encodings:Or other invalid byte sequences:
Invalid UTF-8 might be used to bypass security validations, so this is a security risk.