Closed lukefretwell closed 2 months ago
You got one. https://gov-metadata.civichackingagency.org/robots.txt
Or are you searching for it in your site scans?
@mgifford will search for in site scans.
Hi @lukefretwell, thanks for planning to remove robots metadata from the score. I’m adding some info adapted from bug #99 at your request, in case it’s helpful.
A while back, Google decided that the way to remove content from appearing in search results was to set a robots noindex
for every URL you want removed. So the a robots metadata value is needed when you want to block content from being indexed in public results, or if you want to fine-tune how SEs present it (e.g. nosnippet
which means index it but don’t excerpt a description or other details from the page). A functionally equivalent option is to return a HTTP header X-Robots-Tag: otherbot: noindex, nofollow
in the server response. This is the only way you can tell a search engine not to index a non-HTML resource. Most (all?) major search engines are opt-out so explicitly allowing follow and index is not necessary.
Robots.txt is somewhat different. It’s a way to instruct crawlers where to go or not go. You can’t use it to remove something from the index, but you can use it to:
User-agent: GPTBot
→ Disallow: /
)Merkle’s robots.txt validator is worth a look, if you’re not already familiar with it. Worth noting: if you want something de-indexed you need to allow it to be crawled, or else noindex
will have no effect.
For your item “check for robots.txt file” Based on M-23-22 (PDF) I think it makes sense to do this and to ensure that site-wide crawl is not blocked for a US gov site, and that there is a link to an XML sitemap that is scoped to the same domain as the robots.txt file/the domain being scanned. If you find an XML sitemap for a different domain you could flag it, as it won’t be parsable.
For the item “validate that there's either a robots.txt file or the metatag or robots response header” that won’t be comparing the same thing. Robots.txt should exist, ideally, and should contain a sitemap, and shouldn’t block crawl from the homepage. If you’re goal is to check for accidentally wrong metadata, you could check that the homepage does not block indexing or crawl in robots metadata or X-Robots. Otherwise I’m not sure you can conclude anything from the absence of robots meta elements or X-Robots headers.
Refs: Robots meta tag - Crawling directive Robots meta tag, data-nosnippet, and X-Robots-Tag specifications robots.txt Validator and Testing Tool Delivering a Digital-First Public Experience M-23-22 (PDF)
"and shouldn’t block crawl from the homepage
"
I like this as a way to validate being public to search engines, but if we don't know the homepage, for example, West Virginia doesn't respond, but the homepage exists, so how should we handle making sure the homepage isn't blocked.
Should it just check to make sure Disallow: /
and Disallow: *
aren't in there?
@Narlotl that’s an interesting example. If you request /robots.txt
and get a 404 then that site doesn’t have a valid robots.txt and won’t block crawl that way. It could still block if the server responds with X-Robots-Tag: nofollow
. In the case above:
curl -IL "https://www.wv.gov/"
HTTP/1.1 302 Redirect
Location: https://www.wv.gov/Pages/default.aspx
HTTP/1.1 200 OK
contains no X-Robots, so they aren’t blocking crawl in either of the standard ways. For a gov site I would recommend alerting that 302 (temporary redirect) from the site root is not a great idea, but it’s a different issue. If your crawler gets to /Pages/default.aspx
and then gets 200s leading to parsable pages on most internal links it finds there then the site’s not blocking crawl from the homepage. Here are more details about what Lighthouse looks for when checking a site is crawlable.
As for what to check for if you get a 200 for /robots.txt
the Lighthouse page has a good list. If the syntax is all correct then to be crawlable this should not be there:
user-agent: *
disallow: /
or any variation that blocks a specific user agent like Google, Bing, etc.
@adriancooke I made the page for robots, let me know what you think of the parameters and scoring. https://gov-metadata.civichackingagency.org/?field=robots&level=1 https://gov-metadata.civichackingagency.org/profile/?domain=18f.gov#robots
@Narlotl I’m getting timeouts when I try to load those URLs.
We changed domains, it looks like we need to fix redirection. Here's the new site: https://scangov.org/?field=robots&level=1
@adriancooke closing this but let us know if there's anything not working as you think it should. Thank you for contributing to the project!
<i class="fa-solid fa-robot"></i>
Reference