Optimize regex filters - Githubissues

My url filter implementation is really slowing down warc2text right now, being the slowest part of the process after decompression (no 1) and uchardet (no 2) 😅

I've tried a number of optimisations, and this combination seems to be the fastest:

Skip storing submatches: saves a lot of tiny allocations
Combine all regular expressions into a large one for the URL filter
Switch out std::regex for boost::regex (which seems to be quite a bit better at optimising)

Runtimes on a single wide6 warc:	Setting	Time
master with url filter* (35 entries)	1:15
master with url filter** (1 entry)	0:52
master without url filter	0:47
this branch with url filter (35 entries)	0:50
this branch with url filter** (1 entry)	0:48

*) The url filter list I used matched only 3 of the 29223 documents, so the number of filtered out documents should have little effect on the numbers above. **) The single entry url filter is the previous filter, but merged in the form of ^(https?:)?(//)?((pattern1)|(pattern2)|...)

I'm a bit sad that the 35 entries vs 1 entry still has such a difference. I would have hoped that using the optimize flag they would have collapsed into similar state machines. But this does not seem to happen.

bitextor / warc2text

Optimize regex filters #28