NCEAS / metacatui

MetacatUI: A client-side web interface for DataONE data repositories
Apache License 2.0
42 stars 24 forks source link

default robots.txt to control harvesting #2272

Open mbjones opened 4 months ago

mbjones commented 4 months ago

Describe the feature you'd like

Add a robots.txt file that is easily configured for production and test deployments.

For production, the file should generally restrict access to the package service, but allow everything else, and provide the sitemap link:

User-agent: *
Disallow: /metacat/d1/mn/v2/packages/

For testing, the file should restrict access to everything:

User-agent: *
Disallow: /

Is your feature request related to a problem? Please describe.

Duplicated test datasets show up in Google Dataset search and flood the results, making it hard to find the real production datasets.

MetacatUI provides a searchable web interface which, in combination with a metacat-provided sitemap.xml document, enables harvesters like Googlebot and others to index the site and all of its datasets. Generally we do not want those harvesters to index any test deployments, as they generally have bogus content. An example can be seen here:


mbjones commented 4 months ago

Manually deployed some robots.txt files, tracking in our deployments list here:

artntek commented 2 months ago

Added to metacat helm chart in PR 1893.

This adds a robots.txt to the metacat installation, and adds a rewrite rule to redirect /robots.txt to its location.

If the metacat property sitemap.enabled=false (the default setting for k8s metacat deployment), then robots.txt defaults to:

User-agent: *
Disallow: /

If sitemap.enabled=true, then robots.txt defaults to:

User-agent: *
Disallow: /<metacat.application.context>/d1/mn/v2/packages/
Sitemap: /sitemap_index.xml

...but values for User-agent: and Disallow: can be customized via Values.yaml.

rushirajnenuji commented 2 months ago