IQSS / dataverse

Open source research data repository software
http://dataverse.org
Other
878 stars 487 forks source link

allow /sitemap/ in bundled robots.txt #8329

Closed donsizemore closed 1 year ago

donsizemore commented 2 years ago

I've read the earlier issues regarding user-agents and IQSS' preference that each installation handle robots.txt customization per the documentation, however:

I'm currently waiting on one particular search engine to find our updated robots.txt and allow itself to pull our updated sitemap - or at least I think I am.

Unless IQSS disagrees, I'll submit a PR?

qqmyers commented 2 years ago

Not sure what you're proposing. The robots.txt in the war file just does Disallow: / and the sample mentioned at https://guides.dataverse.org/en/latest/installation/config.html#ensure-robots-txt-is-not-blocking-search-engines does allow /sitemap/. Are you suggesting using the sample as the default in the war? Or?

(That section of the guides does mention that some indexers will look for it at /sitemap.xml which, in the sample is not allowed - seems like a tweak to Allow: /sitemap.xml or a change to the guide to note that it is blocked might be useful.)

(I also see that there is a Sitemap: directive that is supposed to indicate where your sitemap is - not sure if that is used enough to also include in an update robots.txt sample?)

eunices commented 2 years ago

Included Sitemap: directive and also submitted it to Google Search Console's sitemap page.

Also added Allow: /sitemap/ to allow robots.txt to access file. Exclusion of Allow: /sitemap/ results in /sitemap/sitemap.xml generating an General HTTP error on the sitemap page on Google Search Console.

# Accessible to all the crawler bots
User-agent: *

Sitemap: https://<dataverse URL>/sitemap/sitemap.xml

# Allow homepage
Allow: /$
# Allow dataset pages
Allow: /dataset.xhtml
# Allow Dataverse pages
Allow: /dataverse/
# Allow sitemaps
Allow: /sitemap/
Allow: /sitemap.xml$

# The following lines are for the Facebook/Twitter/Linkedin preview bots:
Allow: /api/datasets/:persistentId/thumbnail
Allow: /logos/*/*
Allow: /javax.faces.resource/images/

# Disallow pages with jsessionid
Disallow: /*;jsessionid
# Disallow crawling via search API
Disallow: /dataverse/*?q
Disallow: /dataverse/*/search
# Disallow all other pages
Disallow: /

Crawl-delay: 20

image

/sitemap/ should be included in robots.txt that is served. However, I think that it should be in the robots.txt here: https://guides.dataverse.org/en/latest/installation/config.html#ensure-robots-txt-is-not-blocking-search-engines but not the robots.txt in the .war file.

Referencing the related Google groups post: https://groups.google.com/g/dataverse-community/c/tjByZYsLmss