bitextor / warc2text

Extracts plain text, language identification and more metadata from WARC records
MIT License
20 stars 5 forks source link

Optimize regex filters #28

Closed jelmervdl closed 3 years ago

jelmervdl commented 3 years ago

My url filter implementation is really slowing down warc2text right now, being the slowest part of the process after decompression (no 1) and uchardet (no 2) šŸ˜…

I've tried a number of optimisations, and this combination seems to be the fastest:

Runtimes on a single wide6 warc: Setting Time
master with url filter* (35 entries) 1:15
master with url filter** (1 entry) 0:52
master without url filter 0:47
this branch with url filter (35 entries) 0:50
this branch with url filter** (1 entry) 0:48

*) The url filter list I used matched only 3 of the 29223 documents, so the number of filtered out documents should have little effect on the numbers above. **) The single entry url filter is the previous filter, but merged in the form of ^(https?:)?(//)?((pattern1)|(pattern2)|...)

I'm a bit sad that the 35 entries vs 1 entry still has such a difference. I would have hoped that using the optimize flag they would have collapsed into similar state machines. But this does not seem to happen.