civichackingagency / scangov

Government digital experience monitor
https://scangov.org
8 stars 2 forks source link

Absence of some meta elements lowers grade but their presence is not necessary #99

Closed adriancooke closed 3 months ago

adriancooke commented 3 months ago

Describe the bug Hi Luke, I learned about your tool and it’s report of new.nsf.gov today. Two of the elements your tool requires are absent, causing a lower grade, but it’s not clear how their presence would improve anything.

To Reproduce Steps to reproduce the behavior:

  1. Go to https://gov-metadata.civichackingagency.org/profile/?domain=new.nsf.gov
  2. Click on Metadata
  3. Scroll down to <meta name="robots"> and <meta property="og:locale">
  4. See error “Missing”

Expected behavior Neither element should be required and their absence should not lower the score.

Desktop (please complete the following information):

Additional context

1. Robots meta The robots metadata element is needed when you want to block content from being indexed in public results, or if you want to fine-tune how SEs present it (e.g. nosnippet). However, the use cases are so contextual a blanket requirement does not seem to make sense.

For example, if a public homepage such as new.nsf.gov contained <meta name="robots" content="index, follow"> it would be treated the same way by search engines as it is now: it would be followed and indexed as this is the default behavior of search engines.

But let’s say I did want to prevent the page from being indexed: it still doesn’t make sense to require the metadata element unless you’re also checking HTTP headers, because a functionally equivalent option is to return a HTTP header X-Robots-Tag: otherbot: noindex, nofollow in the server response. This is the only way you can tell a search engine not to index a non-HTML resource. So gov-metadata needs more information before it can conclude a site’s metadata is insufficient (i.e. the absence causes harm).

For this check to be valid, it would need to be configurable against a specific intention, such as a checkbox for “Site should not be public and indexable” which, if checked, would return an error if the corresponding robots meta element or HTTP header was absent. What is the reason for requiring <meta name="robots" content="index, follow"> to be present if that is what search engines already assume (whether we like it or not), because they have effectively decided that indexing is opt-out?

2. OG locale Similarly, if og:locale is documented by the protocol authors as optional metadata, assuming a default of en_US, what is the rationale for reducing the site’s grade if it’s locale is in fact the US?

Refs: Robots meta tag - Crawling directive Robots meta tag, data-nosnippet, and X-Robots-Tag specifications The Open Graph protocol

lukefretwell commented 3 months ago

Thank you @adriancooke for this thoughtful input. @Narlotl and I discussed robots and we're going to revamp how we treat that (see #29). We want to ensure that sites are not actively blocking robots. Let us know if you have thoughts on this in #29. Feel free to copy/paste specific thoughts from above into the comments there.

Could you add a new issue for locale so we can discuss that separately?

Closing this issue as part of it is a dupe and part of it will (hopefully) be created as a new issue.

adriancooke commented 3 months ago

@lukefretwell, @Narlotl thanks for following up. I created #103 for og:locale and added a comment to #29.