Performance bottleneck in `prune_unwanted_nodes` causing 200ms per call

adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Apache License 2.0

3.67k stars 263 forks source link

When profiling trafilatura.bare_extraction method for some pages that took us a while to parse, I found that significant performance issues in extract_content method.

Root cause: Too many calls to prune_unwanted_node with each call taking upto ~200ms.

Steps to reproduce:

Run python with cProfile profiler on bare_extraction
Try to scrape and parse this page https://nlewo.github.io/nixos-manual-sphinx/generated/options-db.xml.html
Observe prune_unwanted_node timing

Questions

Are there existing configuration options or alternative methods in trafilatura to bypass/optimize this operation?
Could we leverage Rust/C++ based HTML parsing libraries (like lxml-rs or html5ever) to accelerate node pruning?
Could we leverage caching to reduce latency?

adbar / trafilatura

Performance bottleneck in `prune_unwanted_nodes` causing 200ms per call #750