IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

robots.txt disallowing all, preventing crawling by Google etc. #227

Closed pdurbin closed 1 year ago

pdurbin commented 1 year ago

This seems like a problem...

$ curl https://dataverse.harvard.edu/robots.txt
User-agent: *
Disallow: /

... as is probably leading to datasets not being indexed by Google, as explained at https://groups.google.com/g/dataverse-community/c/hL4nt-9GQBw/m/lK4twagWAgAJ and https://iqss.slack.com/archives/C010LA04BCG/p1689270974095009

We can use https://search.google.com/test/rich-results to test. Here's my dataset:

Screen Shot 2023-07-14 at 1 56 25 PM

jggautier commented 1 year ago

Is https://github.com/IQSS/dataverse/issues/8936 related?

Since that issue was open, I've been thinking that the problem it describes is also preventing Google from indexing all published datasets in repositories with over 50k datasets, like Harvard's.

pdurbin commented 1 year ago

Yes but if you check any dataset in the sitemap with the tool above it will also show as blocked.

pdurbin commented 1 year ago

In https://guides.dataverse.org/en/5.13/installation/config.html#ensure-robots-txt-is-not-blocking-search-engines we recommend not letting Payara serve the robots.txt (with everything blocked) we ship in the war file.

Instead, we ask installations to serve up this robots.txt: https://guides.dataverse.org/en/5.13/_downloads/3a5cd7a283eecd5e93289e30af713554/robots.txt

User-agent: *
# Note: In its current form, this sample robots.txt makes the site
# accessible to all the crawler bots (specified as "User-agent: *")
# It further instructs the bots to access and index the dataverse and dataset pages;
# it also tells them to stay away from all other pages (the "Disallow: /" line); 
# and also not to follow any search links on a dataverse page. 
# It is possible to specify different access rules for different bots. 
# For example, if you only want to make the site accessed by Googlebot, but 
# keep all the other bots away, un-comment out the following two lines:
#Disallow: /
#User-agent: Googlebot
Allow: /$
Allow: /dataset.xhtml
Allow: /dataverse/
Allow: /sitemap/
# The following lines are for the facebook, twitter and linkedin preview bots:
Allow: /api/datasets/:persistentId/thumbnail
Allow: /javax.faces.resource/images/
# Comment out the following TWO lines if you DON'T MIND the bots crawling the search API links on dataverse pages:
Disallow: /dataverse/*?q
Disallow: /dataverse/*/search
Disallow: /
# Crawl-delay specification *may* be honored by *some* bots. 
# It is *definitely* ignored by Googlebot (they never promise to 
# recognize it either - it's never mentioned in their documentation)
Crawl-delay: 20
kcondon commented 1 year ago

Will need to check with Leonid but I think this sometimes gets shut off due to unconstrained crawling impacting our site (throttling anyone?)

Crawl-delay specification may be honored by some bots.

It is definitely ignored by Googlebot (they never promise to

recognize it either - it's never mentioned in their documentation)

Crawl-delay: 20

This could be somewhat related to crawling of facets and their intermittent slow performance slowing down the homepage, needing a solr restart.

kcondon commented 1 year ago

These were changed to be blocking on 6/25/23, possibly due to service instability during community meeting. I've restored the robots.txt.PRESERVED file from that time, which has a longer crawl-delay:

User-agent: *
Disallow: /
User-agent: Googlebot
User-agent: soscan (+https://dataone.org/)
Allow: /$
Allow: /dataset.xhtml
Allow: /dataverse/
Allow: /sitemap/
Allow: /api/datasets/:persistentId/thumbnail
Allow: /javax.faces.resource/images/
Disallow: /dataverse/*?q
Disallow: /dataverse/*/search
Disallow: /dataset.xhtml?*&version=&q=
Disallow: /
Crawl-delay: 100
#sitemap: https://dataverse.harvard.edu/sitemap/sitemap.xml
# Created initially using: http://www.mcanerin.com/EN/search-engine/robots-txt.asp
# Verified using: http://tool.motoricerca.info/robots-checker.phtml