My url filter implementation is really slowing down warc2text right now, being the slowest part of the process after decompression (no 1) and uchardet (no 2) š
I've tried a number of optimisations, and this combination seems to be the fastest:
Skip storing submatches: saves a lot of tiny allocations
Combine all regular expressions into a large one for the URL filter
Switch out std::regex for boost::regex (which seems to be quite a bit better at optimising)
Runtimes on a single wide6 warc:
Setting
Time
master with url filter* (35 entries)
1:15
master with url filter** (1 entry)
0:52
master without url filter
0:47
this branch with url filter (35 entries)
0:50
this branch with url filter** (1 entry)
0:48
*) The url filter list I used matched only 3 of the 29223 documents, so the number of filtered out documents should have little effect on the numbers above.
**) The single entry url filter is the previous filter, but merged in the form of ^(https?:)?(//)?((pattern1)|(pattern2)|...)
I'm a bit sad that the 35 entries vs 1 entry still has such a difference. I would have hoped that using the optimize flag they would have collapsed into similar state machines. But this does not seem to happen.
My url filter implementation is really slowing down warc2text right now, being the slowest part of the process after decompression (no 1) and uchardet (no 2) š
I've tried a number of optimisations, and this combination seems to be the fastest:
*) The url filter list I used matched only 3 of the 29223 documents, so the number of filtered out documents should have little effect on the numbers above. **) The single entry url filter is the previous filter, but merged in the form of
^(https?:)?(//)?((pattern1)|(pattern2)|...)
I'm a bit sad that the 35 entries vs 1 entry still has such a difference. I would have hoped that using the
optimize
flag they would have collapsed into similar state machines. But this does not seem to happen.