adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.67k stars 263 forks source link

Focused crawler returns 404 response for robots.txt and stops crawling #726

Closed Guthman closed 1 month ago

Guthman commented 1 month ago
from trafilatura.spider import focused_crawler

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass

crawl_start_url = 'https://nicegui.io/documentation'
to_visit, known_links = focused_crawler(homepage=crawl_start_url, max_seen_urls=1, rules=IgnoreRobotFileParser())
to_visit, known_links = focused_crawler(crawl_start_url, max_seen_urls=1000, max_known_urls=100000, todo=to_visit, known_links=known_links)

Result: ERROR:trafilatura.downloads:not a 200 response: 404 for URL https://nicegui.io/robots.txt

Not overriding the RobotFileParser class (= not passing anything to the rules parameter of focused_crawler()) has the same result.

adbar commented 1 month ago

Hi @Guthman, the website you're trying to crawl simply doesn't have internal links, it's a single-page web app so there is nothing to crawl. The 404 error is irrelevant here.