commonmark / cmark

CommonMark parsing and rendering library and program in C
Other
1.64k stars 546 forks source link

Fix quadratic behavior with inline HTML #380

Closed nwellnhof closed 3 years ago

nwellnhof commented 3 years ago

Repeated starting sequences like <?, <!DECL or <![CDATA[ could lead to quadratic behavior if no matching ending sequence was found. Separate the inline HTML scanners. Remember if scanning the whole input for a specific ending sequence failed and skip subsequent scans.

The basic idea is to remove suffixes >, ?> and ]]> from the respective regex. Since these regexes are already constructed to match lazily, they will stop before an ending sequence. To check whether an ending sequence was found, we can simply test whether the input buffer is large enough to hold the match plus a potential suffix. If the regex doesn't find the ending sequence, it will match so many characters that this test is guaranteed to fail. In this case, we set a flag to avoid further attempts to execute the regex.

To check which inline HTML regex to use, we inspect the start of the text buffer. This allows some fixed characters to be removed from the start of some regexes. matchlen is adjusted with a single addition that accounts for both the relevant prefix and suffix.

Fixes #299.