adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.61k stars 258 forks source link

Crawler doesn't extract any links from Google Cloud documentation website #680

Closed Guthman closed 2 months ago

Guthman commented 2 months ago
from trafilatura.spider import focused_crawler
crawl_start_url = 'https://cloud.google.com/docs'
to_visit, known_links = focused_crawler(homepage=crawl_start_url, max_seen_urls=1000, max_known_urls=1000)

to_visit is empty and known_links only contains the input url

Ignoring robots.txt (using the rule below) doesn't seem to help...

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass
adbar commented 2 months ago

That's correct, there is something wrong with relative link processing here.

adbar commented 2 months ago

Google is blacklisted by the underlying courlan package, this can simply be bypassed by passing the strict=False parameter to the extract_links() function in the spider module.

cjgalvin commented 3 days ago

Sorry to comment on a closed issue, but I wanted to check if this solution still works. I ran into a similar result as the original poster on different websites. That led me to this issue.

It looks like the PR set the default to strict=False for extract_links, so I would expect the Google Cloud docs from the original post to work. However, I get the same result as the original post: to_visit is empty and known_links only contains the input website. That's the same result I see with the other websites.

To be clear, my other websites may have different issues, and this question is focused on why I cannot crawl https://cloud.google.com/docs. The scraper works for other websites designed to be scraped. I am also able to download https://cloud.google.com/docs using bare_extraction.

I am on trafilatura v1.12.2. Here is my code (I tried with and without the original post's IgnoreRobotFileParser rules):

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass

url = "https://cloud.google.com/docs"
to_visit, known_links = focused_crawler(url, max_seen_urls=10, max_known_urls=10, rules=IgnoreRobotFileParser())

Thank you in advance.

Guthman commented 12 hours ago

I've moved on from trafilatura as my use case requires some more capabilities than this library can offer (like Javascript support), so I don't know, sorry.