civichackingagency / scangov

Government digital experience monitor
https://scangov.org
8 stars 2 forks source link

Test for robots #29

Closed lukefretwell closed 2 months ago

lukefretwell commented 6 months ago

Reference

mgifford commented 3 months ago

You got one. https://gov-metadata.civichackingagency.org/robots.txt

Or are you searching for it in your site scans?

lukefretwell commented 3 months ago

@mgifford will search for in site scans.

adriancooke commented 3 months ago

Hi @lukefretwell, thanks for planning to remove robots metadata from the score. I’m adding some info adapted from bug #99 at your request, in case it’s helpful.

A while back, Google decided that the way to remove content from appearing in search results was to set a robots noindex for every URL you want removed. So the a robots metadata value is needed when you want to block content from being indexed in public results, or if you want to fine-tune how SEs present it (e.g. nosnippet which means index it but don’t excerpt a description or other details from the page). A functionally equivalent option is to return a HTTP header X-Robots-Tag: otherbot: noindex, nofollow in the server response. This is the only way you can tell a search engine not to index a non-HTML resource. Most (all?) major search engines are opt-out so explicitly allowing follow and index is not necessary.

Robots.txt is somewhat different. It’s a way to instruct crawlers where to go or not go. You can’t use it to remove something from the index, but you can use it to:

  1. Request crawlers ignore certain paths
  2. Request crawlers slow their crawl rate
  3. Request a specific crawler ignore the site (e.g. User-agent: GPTBotDisallow: /)
  4. Link to XML sitemaps

Merkle’s robots.txt validator is worth a look, if you’re not already familiar with it. Worth noting: if you want something de-indexed you need to allow it to be crawled, or else noindex will have no effect.

For your item “check for robots.txt file” Based on M-23-22 (PDF) I think it makes sense to do this and to ensure that site-wide crawl is not blocked for a US gov site, and that there is a link to an XML sitemap that is scoped to the same domain as the robots.txt file/the domain being scanned. If you find an XML sitemap for a different domain you could flag it, as it won’t be parsable.

For the item “validate that there's either a robots.txt file or the metatag or robots response header” that won’t be comparing the same thing. Robots.txt should exist, ideally, and should contain a sitemap, and shouldn’t block crawl from the homepage. If you’re goal is to check for accidentally wrong metadata, you could check that the homepage does not block indexing or crawl in robots metadata or X-Robots. Otherwise I’m not sure you can conclude anything from the absence of robots meta elements or X-Robots headers.

Refs: Robots meta tag - Crawling directive Robots meta tag, data-nosnippet, and X-Robots-Tag specifications robots.txt Validator and Testing Tool Delivering a Digital-First Public Experience M-23-22 (PDF)

Narlotl commented 3 months ago

"and shouldn’t block crawl from the homepage" I like this as a way to validate being public to search engines, but if we don't know the homepage, for example, West Virginia doesn't respond, but the homepage exists, so how should we handle making sure the homepage isn't blocked.

Should it just check to make sure Disallow: / and Disallow: * aren't in there?

adriancooke commented 3 months ago

@Narlotl that’s an interesting example. If you request /robots.txt and get a 404 then that site doesn’t have a valid robots.txt and won’t block crawl that way. It could still block if the server responds with X-Robots-Tag: nofollow. In the case above:

curl -IL "https://www.wv.gov/"                  

HTTP/1.1 302 Redirect
Location: https://www.wv.gov/Pages/default.aspx

HTTP/1.1 200 OK

contains no X-Robots, so they aren’t blocking crawl in either of the standard ways. For a gov site I would recommend alerting that 302 (temporary redirect) from the site root is not a great idea, but it’s a different issue. If your crawler gets to /Pages/default.aspx and then gets 200s leading to parsable pages on most internal links it finds there then the site’s not blocking crawl from the homepage. Here are more details about what Lighthouse looks for when checking a site is crawlable.

As for what to check for if you get a 200 for /robots.txt the Lighthouse page has a good list. If the syntax is all correct then to be crawlable this should not be there:

user-agent: *
disallow: /

or any variation that blocks a specific user agent like Google, Bing, etc.

Narlotl commented 3 months ago

@adriancooke I made the page for robots, let me know what you think of the parameters and scoring. https://gov-metadata.civichackingagency.org/?field=robots&level=1 https://gov-metadata.civichackingagency.org/profile/?domain=18f.gov#robots

adriancooke commented 2 months ago

@Narlotl I’m getting timeouts when I try to load those URLs.

Narlotl commented 2 months ago

We changed domains, it looks like we need to fix redirection. Here's the new site: https://scangov.org/?field=robots&level=1

lukefretwell commented 2 months ago

@adriancooke closing this but let us know if there's anything not working as you think it should. Thank you for contributing to the project!