Closed donsizemore closed 1 year ago
Not sure what you're proposing. The robots.txt in the war file just does Disallow: /
and the sample mentioned at https://guides.dataverse.org/en/latest/installation/config.html#ensure-robots-txt-is-not-blocking-search-engines does allow /sitemap/. Are you suggesting using the sample as the default in the war? Or?
(That section of the guides does mention that some indexers will look for it at /sitemap.xml which, in the sample is not allowed - seems like a tweak to Allow: /sitemap.xml
or a change to the guide to note that it is blocked might be useful.)
(I also see that there is a Sitemap:
directive that is supposed to indicate where your sitemap is - not sure if that is used enough to also include in an update robots.txt sample?)
Included Sitemap:
directive and also submitted it to Google Search Console's sitemap page.
Also added Allow: /sitemap/
to allow robots.txt to access file. Exclusion of Allow: /sitemap/
results in /sitemap/sitemap.xml
generating an General HTTP error on the sitemap page on Google Search Console.
# Accessible to all the crawler bots
User-agent: *
Sitemap: https://<dataverse URL>/sitemap/sitemap.xml
# Allow homepage
Allow: /$
# Allow dataset pages
Allow: /dataset.xhtml
# Allow Dataverse pages
Allow: /dataverse/
# Allow sitemaps
Allow: /sitemap/
Allow: /sitemap.xml$
# The following lines are for the Facebook/Twitter/Linkedin preview bots:
Allow: /api/datasets/:persistentId/thumbnail
Allow: /logos/*/*
Allow: /javax.faces.resource/images/
# Disallow pages with jsessionid
Disallow: /*;jsessionid
# Disallow crawling via search API
Disallow: /dataverse/*?q
Disallow: /dataverse/*/search
# Disallow all other pages
Disallow: /
Crawl-delay: 20
/sitemap/ should be included in robots.txt that is served. However, I think that it should be in the robots.txt here: https://guides.dataverse.org/en/latest/installation/config.html#ensure-robots-txt-is-not-blocking-search-engines but not the robots.txt in the .war file.
Referencing the related Google groups post: https://groups.google.com/g/dataverse-community/c/tjByZYsLmss
I've read the earlier issues regarding user-agents and IQSS' preference that each installation handle robots.txt customization per the documentation, however:
/sitemap/
would catch any installation's sense of privacy by surprise.I'm currently waiting on one particular search engine to find our updated robots.txt and allow itself to pull our updated sitemap - or at least I think I am.
Unless IQSS disagrees, I'll submit a PR?