Closed bezir closed 6 months ago
The task is complex and the focused crawler integrated in Trafilatura does not solve all problems. I cannot answer this question in general. Do you have a precise example for me to reproduce?
Here is my code with an example input.
Code:
def crawl_homepage(homepage_url, max_iteration, output_file):
to_visit, known_links = focused_crawler(homepage_url, max_seen_urls=1)
i = 0
while i < max_iteration:
to_visit, known_links = focused_crawler(homepage_url, max_seen_urls=10, max_known_urls=300000, todo=to_visit, known_links=known_links)
print("LEN", len(known_links))
save_links_to_file(known_links, output_file + ".json") # Save links to file after every iteration
i += 1
print(f"Finished crawling. Total iterations: {i}")
homepage_url = "https://tr.motorsport.com"
p.s: the page has nothing to do with my task, I share a random url that fails in the extract process.
If you set the logging level to debug you'll see that the download fails (403 error), so there are no links to extract.
How to fix it? I want to be able to get the news links in the website.
You have to use a more complex download utility to make sure you get the full content, then you can use Trafilatura on the HTML.
Thank you Adrien!
Hello,
focused_crawl cannot exploit URLs in certain websites. Is there any parameter or method to overcome this problem?