ershov-konst / dom-parser

Fast dom parser based on regexps
ISC License
108 stars 21 forks source link

tagRegExp hangs with certain URLs: Catastrophic backtracking? #11

Open Esowteric opened 5 years ago

Esowteric commented 5 years ago

The following URL causes a node.js app to hang when matching using dom-parser.

DOM source from: https://www.ecosia.org/

I created a simple vanilla javascript match script and tested the web page source against tagRegExp, and the JS script also hung. Could this be catastrophic backtracking?

tagRegExp: /(<\/?[a-z][a-z0-9](?::[a-z][a-z0-9])?\s(?:\s+[a-z0-9-_]+=(?:(?:'[\s\S]?')|(?:"[\s\S]?")))\s\/?>)|([^<]|<(?![a-z\/]))/gi

Thanks.

Esowteric commented 5 years ago

This is the script I used:

<script type="text/javascript">
var text = '... html source ...';
var text_esc = text
text_esc = text_esc.replace(/\</g, "&lt;");
text_esc = text_esc.replace(/\>/g, "&gt;");
var regex = /(<\/?[a-z][a-z0-9]*(?::[a-z][a-z0-9]*)?\s*(?:\s+[a-z0-9-_]+=(?:(?:'[\s\S]*?')|(?:"[\s\S]*?")))*\s*\/?>)|([^<]|<(?![a-z\/]))*/gi;
var found = text.match(regex);
var found_len = found.length;

document.write("Text: " + text_esc + "<br /><br />" + "Regex pattern: " + regex + "<br /><br />");

document.write("Matches: " + found_len + "<br /><br />");

for (var i=0;i<found_len;i++)
{
    found[i] = found[i].replace(/\</g, "&lt;");
    found[i] = found[i].replace(/\>/g, "&gt;");

    document.write("[" + i + "]: " + found[i] + "<br /><br />");
}
</script>
Esowteric commented 5 years ago

The tagRegExp match is the first stage in the process, to pull out all tags from the DOM into an array, before looking for specific tags using getElementsByTagName, getAttribute, etc.

Esowteric commented 5 years ago

Many thanks to Wiktor Stribiżew at Stack Overflow for this solution:

tagRegExp in /lib/Dom.js:

/(<\/?[a-z][a-z0-9]*(?::[a-z][a-z0-9]*)?\s*(?:\s+[a-z0-9-_]+=(?:'[^']*'|"[^"]*"))*\s*\/?>)|[^<]*(?:<(?![a-z\/])[^<]*)*/gi

See: https://stackoverflow.com/questions/54543223/node-js-dom-parser-tagregexp-regex-match-hangs-catastrophic-backtracking

ershov-konst commented 5 years ago

@Esowteric thx! I will integrate this solution soon.