adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.43k stars 251 forks source link

Crawler doesn't extract any links from Google Cloud documentation website #680

Closed Guthman closed 1 week ago

Guthman commented 3 weeks ago
from trafilatura.spider import focused_crawler
crawl_start_url = 'https://cloud.google.com/docs'
to_visit, known_links = focused_crawler(homepage=crawl_start_url, max_seen_urls=1000, max_known_urls=1000)

to_visit is empty and known_links only contains the input url

Ignoring robots.txt (using the rule below) doesn't seem to help...

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass
adbar commented 3 weeks ago

That's correct, there is something wrong with relative link processing here.

adbar commented 1 week ago

Google is blacklisted by the underlying courlan package, this can simply be bypassed by passing the strict=False parameter to the extract_links() function in the spider module.