Open mbjones opened 4 months ago
Manually deployed some robots.txt files, tracking in our deployments list here: https://docs.google.com/spreadsheets/d/1NtF8DAZCg6eGKGY66ca2nYi2ftmVVGR3-0IHlAZKZCI/edit#gid=0
Added to metacat helm chart in PR 1893.
This adds a robots.txt to the metacat installation, and adds a rewrite rule to redirect /robots.txt to its location.
If the metacat property sitemap.enabled=false
(the default setting for k8s metacat deployment), then robots.txt defaults to:
User-agent: *
Disallow: /
If sitemap.enabled=true
, then robots.txt defaults to:
User-agent: *
Disallow: /<metacat.application.context>/d1/mn/v2/packages/
Sitemap: /sitemap_index.xml
...but values for User-agent:
and Disallow:
can be customized via Values.yaml.
Describe the feature you'd like
Add a
robots.txt
file that is easily configured for production and test deployments.For production, the file should generally restrict access to the package service, but allow everything else, and provide the sitemap link:
For testing, the file should restrict access to everything:
Is your feature request related to a problem? Please describe.
Duplicated test datasets show up in Google Dataset search and flood the results, making it hard to find the real production datasets.
MetacatUI provides a searchable web interface which, in combination with a metacat-provided sitemap.xml document, enables harvesters like Googlebot and others to index the site and all of its datasets. Generally we do not want those harvesters to index any test deployments, as they generally have bogus content. An example can be seen here:
Considerations
robots.txt
deployment. For example, for ADC the MetacatUI is installed at https://arcticdata.io/catalog, and so the robots.txt needs to go at the root.