Restrictions on Web Crawling

conceptofmind commented 1 year ago

Hi all,

I was wondering if there were specific restrictions on web crawling certain sites?

For example if one tried to web crawl Medscape:

from trafilatura.spider import focused_crawler
from trafilatura.spider import is_still_navigation

homepage = 'https://www.medscape.com/'
# starting a crawl
to_visit, known_urls = focused_crawler(homepage, max_seen_urls=1000000, max_known_urls=1000000)
print(known_urls)
is_still_navigation(to_visit)

This returns only a single link: {'https://www.medscape.com/'}

Any input would be greatly appreciated.

Thank you,

Enrico

adbar commented 1 year ago

Hi @conceptofmind, thanks for your feedback! There is a bug somewhere in the link analysis and relative links are not being handled as they should.

adbar commented 1 year ago

It turns out the issue also involves an issue with robots.txt parsing of https://www.medscape.com/robots.txt.

The line Disallow: /*/noscan/ is ambiguous and the standard module urllib.robotparser used here may not be at fault.

adbar / trafilatura

Restrictions on Web Crawling #358