Closed cebreus closed 3 years ago
Yes, linkcheck respects robots.txt
. You're making a good point about this behavior not always being what you want. On the other hand:
linkcheck
can't know if it's crawling the developer's site, or someone else's.But I agree that there might be a large portion of sites that have subsets of pages under robots.txt
that they still want to linkcheck.
I'd be interested in more input from more users, with concrete use cases.
I fully agree with the assumption of respecting robots.txt
. This should be as is.
What kind of response do you expect?
Something like use cases? E.g. robots.txt
should be ignored in different environments. Development and Integration environment shouldn't be any restrictions. Test, Pre-prod and Production should respect robots.txt
because there is data prom pre/production.
Another use case — I have semi-public URLs for internal purposes or special landing pages for marketing campaigns. This pages, SEO expert, won't to index, but they must be fully functional.
Could you please consider changing the text in README.md
for more clarity?
Original: „It goes without saying that linkcheck honors robots.txt and throttles itself when accessing websites.“
New: linkcheck
fully respect definitions in robots.txt
Thanks! I clarified the language in the README. Please create a new issue (copypasting is okay!) if you want to address the "turn off robots.txt" option. I think that needs a bit more thought on my side.
Hi, I'm not sure if
linkcheck
respectrobots.txt
. The sentence in the README.md isn't clear to me.It goes without saying that linkcheck honors robots.txt and throttles itself when accessing websites.
If processes will be nice to have a parameter to avoid this behaviour. Example: Some pages are disabled in
robots.txt
, but should be checked.