adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.67k stars 263 forks source link

Crawler doesn't extract any links from Google Cloud documentation website #680

Closed Guthman closed 3 months ago

Guthman commented 3 months ago
from trafilatura.spider import focused_crawler
crawl_start_url = 'https://cloud.google.com/docs'
to_visit, known_links = focused_crawler(homepage=crawl_start_url, max_seen_urls=1000, max_known_urls=1000)

to_visit is empty and known_links only contains the input url

Ignoring robots.txt (using the rule below) doesn't seem to help...

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass
adbar commented 3 months ago

That's correct, there is something wrong with relative link processing here.

adbar commented 3 months ago

Google is blacklisted by the underlying courlan package, this can simply be bypassed by passing the strict=False parameter to the extract_links() function in the spider module.

cjgalvin commented 4 weeks ago

Sorry to comment on a closed issue, but I wanted to check if this solution still works. I ran into a similar result as the original poster on different websites. That led me to this issue.

It looks like the PR set the default to strict=False for extract_links, so I would expect the Google Cloud docs from the original post to work. However, I get the same result as the original post: to_visit is empty and known_links only contains the input website. That's the same result I see with the other websites.

To be clear, my other websites may have different issues, and this question is focused on why I cannot crawl https://cloud.google.com/docs. The scraper works for other websites designed to be scraped. I am also able to download https://cloud.google.com/docs using bare_extraction.

I am on trafilatura v1.12.2. Here is my code (I tried with and without the original post's IgnoreRobotFileParser rules):

class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
    def can_fetch(self, useragent, url):
        # Always return True to allow fetching any URL regardless of robots.txt
        return True

    def read(self):
        # Override read method to do nothing, avoiding parsing the robots.txt
        pass

url = "https://cloud.google.com/docs"
to_visit, known_links = focused_crawler(url, max_seen_urls=10, max_known_urls=10, rules=IgnoreRobotFileParser())

Thank you in advance.

Guthman commented 3 weeks ago

I've moved on from trafilatura as my use case requires some more capabilities than this library can offer (like Javascript support), so I don't know, sorry.

adbar commented 3 weeks ago

@cjgalvin There might be a problem with the urllib3 dependency on this page. Try installing the optional pycurl package (which Trafilatura supports seamlessly), it is often better and faster.

cjgalvin commented 3 weeks ago

@Guthman no worries, thank you for the response.

@adbar okay, will give it a test.