filiph / linkcheck

Fast link checker
https://pub.dartlang.org/packages/linkcheck
MIT License
403 stars 51 forks source link

Docs: Does `linkcheck` processes / respect `robots.txt? #65

Closed cebreus closed 3 years ago

cebreus commented 3 years ago

Hi, I'm not sure if linkcheck respect robots.txt. The sentence in the README.md isn't clear to me.

It goes without saying that linkcheck honors robots.txt and throttles itself when accessing websites.

If processes will be nice to have a parameter to avoid this behaviour. Example: Some pages are disabled in robots.txt, but should be checked.

filiph commented 3 years ago

Yes, linkcheck respects robots.txt. You're making a good point about this behavior not always being what you want. On the other hand:

  1. I try to be really careful with allowing people to crawl the web with no netiquette considerations. After all, linkcheck can't know if it's crawling the developer's site, or someone else's.
  2. I was operating under the assumption that most sites, most of the time, mostly want to check public links (those accessible from search engines).

But I agree that there might be a large portion of sites that have subsets of pages under robots.txt that they still want to linkcheck.

I'd be interested in more input from more users, with concrete use cases.

cebreus commented 3 years ago

I fully agree with the assumption of respecting robots.txt. This should be as is.

What kind of response do you expect?

Could you please consider changing the text in README.md for more clarity?

Original: „It goes without saying that linkcheck honors robots.txt and throttles itself when accessing websites.“

New: linkcheck fully respect definitions in robots.txt

filiph commented 3 years ago

Thanks! I clarified the language in the README. Please create a new issue (copypasting is okay!) if you want to address the "turn off robots.txt" option. I think that needs a bit more thought on my side.