Closed conceptofmind closed 1 year ago
Hi @conceptofmind, thanks for your feedback! There is a bug somewhere in the link analysis and relative links are not being handled as they should.
It turns out the issue also involves an issue with robots.txt parsing of https://www.medscape.com/robots.txt
.
The line Disallow: /*/noscan/
is ambiguous and the standard module urllib.robotparser
used here may not be at fault.
Hi all,
I was wondering if there were specific restrictions on web crawling certain sites?
For example if one tried to web crawl Medscape:
This returns only a single link:
{'https://www.medscape.com/'}
Any input would be greatly appreciated.
Thank you,
Enrico