Closed Guthman closed 1 month ago
from trafilatura.spider import focused_crawler class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser): def can_fetch(self, useragent, url): # Always return True to allow fetching any URL regardless of robots.txt return True def read(self): # Override read method to do nothing, avoiding parsing the robots.txt pass crawl_start_url = 'https://nicegui.io/documentation' to_visit, known_links = focused_crawler(homepage=crawl_start_url, max_seen_urls=1, rules=IgnoreRobotFileParser()) to_visit, known_links = focused_crawler(crawl_start_url, max_seen_urls=1000, max_known_urls=100000, todo=to_visit, known_links=known_links)
Result: ERROR:trafilatura.downloads:not a 200 response: 404 for URL https://nicegui.io/robots.txt
Not overriding the RobotFileParser class (= not passing anything to the rules parameter of focused_crawler()) has the same result.
focused_crawler()
Hi @Guthman, the website you're trying to crawl simply doesn't have internal links, it's a single-page web app so there is nothing to crawl. The 404 error is irrelevant here.
Result: ERROR:trafilatura.downloads:not a 200 response: 404 for URL https://nicegui.io/robots.txt
Not overriding the RobotFileParser class (= not passing anything to the rules parameter of
focused_crawler()
) has the same result.