adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.63k stars 259 forks source link

focused_crawl returns nothing #589

Closed bezir closed 6 months ago

bezir commented 6 months ago

Hello,

focused_crawl cannot exploit URLs in certain websites. Is there any parameter or method to overcome this problem?

adbar commented 6 months ago

The task is complex and the focused crawler integrated in Trafilatura does not solve all problems. I cannot answer this question in general. Do you have a precise example for me to reproduce?

bezir commented 6 months ago

Here is my code with an example input.

Code:

def crawl_homepage(homepage_url, max_iteration, output_file):
    to_visit, known_links = focused_crawler(homepage_url, max_seen_urls=1)

    i = 0
    while i < max_iteration:
        to_visit, known_links = focused_crawler(homepage_url, max_seen_urls=10, max_known_urls=300000, todo=to_visit, known_links=known_links)
        print("LEN", len(known_links))
        save_links_to_file(known_links, output_file + ".json")  # Save links to file after every iteration
        i += 1
    print(f"Finished crawling. Total iterations: {i}")

homepage_url = "https://tr.motorsport.com"

p.s: the page has nothing to do with my task, I share a random url that fails in the extract process.

adbar commented 6 months ago

If you set the logging level to debug you'll see that the download fails (403 error), so there are no links to extract.

bezir commented 6 months ago

How to fix it? I want to be able to get the news links in the website.

adbar commented 6 months ago

You have to use a more complex download utility to make sure you get the full content, then you can use Trafilatura on the HTML.

bezir commented 5 months ago

Thank you Adrien!