Closed Guthman closed 2 months ago
That's correct, there is something wrong with relative link processing here.
Google is blacklisted by the underlying courlan package, this can simply be bypassed by passing the strict=False
parameter to the extract_links()
function in the spider module.
Sorry to comment on a closed issue, but I wanted to check if this solution still works. I ran into a similar result as the original poster on different websites. That led me to this issue.
It looks like the PR set the default to strict=False for extract_links, so I would expect the Google Cloud docs from the original post to work. However, I get the same result as the original post: to_visit is empty and known_links only contains the input website. That's the same result I see with the other websites.
To be clear, my other websites may have different issues, and this question is focused on why I cannot crawl https://cloud.google.com/docs. The scraper works for other websites designed to be scraped. I am also able to download https://cloud.google.com/docs using bare_extraction.
I am on trafilatura v1.12.2. Here is my code (I tried with and without the original post's IgnoreRobotFileParser rules):
class IgnoreRobotFileParser(urllib.robotparser.RobotFileParser):
def can_fetch(self, useragent, url):
# Always return True to allow fetching any URL regardless of robots.txt
return True
def read(self):
# Override read method to do nothing, avoiding parsing the robots.txt
pass
url = "https://cloud.google.com/docs"
to_visit, known_links = focused_crawler(url, max_seen_urls=10, max_known_urls=10, rules=IgnoreRobotFileParser())
Thank you in advance.
I've moved on from trafilatura as my use case requires some more capabilities than this library can offer (like Javascript support), so I don't know, sorry.
to_visit is empty and known_links only contains the input url
Ignoring robots.txt (using the rule below) doesn't seem to help...