adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.67k stars 263 forks source link

spider: restrict search to given URL pattern #672

Closed adbar closed 3 months ago

adbar commented 3 months ago

Both on the CLI and with Python the spider component stores and retrieves URLs which are possibly out of scope if the input URL is restricted to a portion of a domain, e.g. https://www.example.org/news/en/.

This behavior should be further investigated, tested and/or improved.