Focused crawler returns 404 response for robots.txt and stops crawling

from trafilatura.spider import focused_crawler

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass

crawl_start_url = 'https://nicegui.io/documentation'
to_visit, known_links = focused_crawler(homepage=crawl_start_url, max_seen_urls=1, rules=IgnoreRobotFileParser())
to_visit, known_links = focused_crawler(crawl_start_url, max_seen_urls=1000, max_known_urls=100000, todo=to_visit, known_links=known_links)

Result: ERROR:trafilatura.downloads:not a 200 response: 404 for URL https://nicegui.io/robots.txt

Not overriding the RobotFileParser class (= not passing anything to the rules parameter of focused_crawler()) has the same result.

adbar / trafilatura

Focused crawler returns 404 response for robots.txt and stops crawling #726