Open Esowteric opened 5 years ago
This is the script I used:
<script type="text/javascript">
var text = '... html source ...';
var text_esc = text
text_esc = text_esc.replace(/\</g, "<");
text_esc = text_esc.replace(/\>/g, ">");
var regex = /(<\/?[a-z][a-z0-9]*(?::[a-z][a-z0-9]*)?\s*(?:\s+[a-z0-9-_]+=(?:(?:'[\s\S]*?')|(?:"[\s\S]*?")))*\s*\/?>)|([^<]|<(?![a-z\/]))*/gi;
var found = text.match(regex);
var found_len = found.length;
document.write("Text: " + text_esc + "<br /><br />" + "Regex pattern: " + regex + "<br /><br />");
document.write("Matches: " + found_len + "<br /><br />");
for (var i=0;i<found_len;i++)
{
found[i] = found[i].replace(/\</g, "<");
found[i] = found[i].replace(/\>/g, ">");
document.write("[" + i + "]: " + found[i] + "<br /><br />");
}
</script>
The tagRegExp match is the first stage in the process, to pull out all tags from the DOM into an array, before looking for specific tags using getElementsByTagName, getAttribute, etc.
Many thanks to Wiktor Stribiżew at Stack Overflow for this solution:
tagRegExp in /lib/Dom.js:
/(<\/?[a-z][a-z0-9]*(?::[a-z][a-z0-9]*)?\s*(?:\s+[a-z0-9-_]+=(?:'[^']*'|"[^"]*"))*\s*\/?>)|[^<]*(?:<(?![a-z\/])[^<]*)*/gi
@Esowteric thx! I will integrate this solution soon.
The following URL causes a node.js app to hang when matching using dom-parser.
DOM source from: https://www.ecosia.org/
I created a simple vanilla javascript match script and tested the web page source against tagRegExp, and the JS script also hung. Could this be catastrophic backtracking?
tagRegExp: /(<\/?[a-z][a-z0-9](?::[a-z][a-z0-9])?\s(?:\s+[a-z0-9-_]+=(?:(?:'[\s\S]?')|(?:"[\s\S]?")))\s\/?>)|([^<]|<(?![a-z\/]))/gi
Thanks.