ageitgey / node-unfluff

Automatically extract body content (and other cool stuff) from an html document
Apache License 2.0
2.15k stars 221 forks source link

Bad regex causing very slow execution #112

Open kduffie opened 4 years ago

kduffie commented 4 years ago

If you attempt to run unfluff on the body of the following webpage, https://craftsbyamanda.com/vibrant-button-tree-on-canvas-a-giveaway/ you'll see that it takes more than 10secs on a fast Mac.

The problem has been isolated to line 22 of lib/extractor.js which takes around 10s to execute when operating on the contents of that particular webpage:

copyright = text.replace(/.*?©(\s*copyright)?([^,;:.|\r\n]+).*/gi, '$2').trim();