Open thsunkid opened 5 days ago
Hi @thsunkid, thanks for the detailed report and the example. We're talking about a web page which is very large (> 8Mb) and contains a lot of similar elements. This is not a representative example as it is fairly exceptional, so it doesn't seem worth it to chang the pipeline just for such cases.
You can adapt the function's arguments to your use case, see optimizing for speed section of the docs.
You can also have a look at the list of cleaned elements in settings.py
if you want to speed up things a bit but everything takes more time since the web page is unusually large.
When profiling
trafilatura.bare_extraction
method for some pages that took us a while to parse, I found that significant performance issues inextract_content
method.Root cause: Too many calls to
prune_unwanted_node
with each call taking upto ~200ms.Steps to reproduce:
bare_extraction
prune_unwanted_node
timingQuestions