ArchiveTeam / grab-site

The archivist's web crawler: WARC output, dashboard for all crawls, dynamic ignore patterns
Other
1.32k stars 129 forks source link

grab-site spends a lot of time in dupespotter #134

Open ivan opened 5 years ago

ivan commented 5 years ago

With grab-site 2.x, a crawl of Twitter spends about 25% of its non-idle time in dupespotter, doing various re.subs.

ivan commented 5 years ago

Times to run dupespotter's test suite:

as-is with many re.sub: 0.79 seconds combined regexps and a few re.subs: 3 seconds combined regexps and re2 with hand-rolled sub: 6 seconds

Not encouraging as every change makes it slower.

I left the changes in the re2 branch.