adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
https://trafilatura.readthedocs.io
Apache License 2.0
3.52k stars 255 forks source link

Restrictions on Web Crawling #358

Closed conceptofmind closed 1 year ago

conceptofmind commented 1 year ago

Hi all,

I was wondering if there were specific restrictions on web crawling certain sites?

For example if one tried to web crawl Medscape:

from trafilatura.spider import focused_crawler
from trafilatura.spider import is_still_navigation

homepage = 'https://www.medscape.com/'
# starting a crawl
to_visit, known_urls = focused_crawler(homepage, max_seen_urls=1000000, max_known_urls=1000000)
print(known_urls)
is_still_navigation(to_visit)

This returns only a single link: {'https://www.medscape.com/'}

Any input would be greatly appreciated.

Thank you,

Enrico

adbar commented 1 year ago

Hi @conceptofmind, thanks for your feedback! There is a bug somewhere in the link analysis and relative links are not being handled as they should.

adbar commented 1 year ago

It turns out the issue also involves an issue with robots.txt parsing of https://www.medscape.com/robots.txt.

The line Disallow: /*/noscan/ is ambiguous and the standard module urllib.robotparser used here may not be at fault.