Closed pdurbin closed 1 year ago
Is https://github.com/IQSS/dataverse/issues/8936 related?
Since that issue was open, I've been thinking that the problem it describes is also preventing Google from indexing all published datasets in repositories with over 50k datasets, like Harvard's.
Yes but if you check any dataset in the sitemap with the tool above it will also show as blocked.
In https://guides.dataverse.org/en/5.13/installation/config.html#ensure-robots-txt-is-not-blocking-search-engines we recommend not letting Payara serve the robots.txt (with everything blocked) we ship in the war file.
Instead, we ask installations to serve up this robots.txt: https://guides.dataverse.org/en/5.13/_downloads/3a5cd7a283eecd5e93289e30af713554/robots.txt
User-agent: *
# Note: In its current form, this sample robots.txt makes the site
# accessible to all the crawler bots (specified as "User-agent: *")
# It further instructs the bots to access and index the dataverse and dataset pages;
# it also tells them to stay away from all other pages (the "Disallow: /" line);
# and also not to follow any search links on a dataverse page.
# It is possible to specify different access rules for different bots.
# For example, if you only want to make the site accessed by Googlebot, but
# keep all the other bots away, un-comment out the following two lines:
#Disallow: /
#User-agent: Googlebot
Allow: /$
Allow: /dataset.xhtml
Allow: /dataverse/
Allow: /sitemap/
# The following lines are for the facebook, twitter and linkedin preview bots:
Allow: /api/datasets/:persistentId/thumbnail
Allow: /javax.faces.resource/images/
# Comment out the following TWO lines if you DON'T MIND the bots crawling the search API links on dataverse pages:
Disallow: /dataverse/*?q
Disallow: /dataverse/*/search
Disallow: /
# Crawl-delay specification *may* be honored by *some* bots.
# It is *definitely* ignored by Googlebot (they never promise to
# recognize it either - it's never mentioned in their documentation)
Crawl-delay: 20
Will need to check with Leonid but I think this sometimes gets shut off due to unconstrained crawling impacting our site (throttling anyone?)
Crawl-delay: 20
This could be somewhat related to crawling of facets and their intermittent slow performance slowing down the homepage, needing a solr restart.
These were changed to be blocking on 6/25/23, possibly due to service instability during community meeting. I've restored the robots.txt.PRESERVED file from that time, which has a longer crawl-delay:
User-agent: *
Disallow: /
User-agent: Googlebot
User-agent: soscan (+https://dataone.org/)
Allow: /$
Allow: /dataset.xhtml
Allow: /dataverse/
Allow: /sitemap/
Allow: /api/datasets/:persistentId/thumbnail
Allow: /javax.faces.resource/images/
Disallow: /dataverse/*?q
Disallow: /dataverse/*/search
Disallow: /dataset.xhtml?*&version=&q=
Disallow: /
Crawl-delay: 100
#sitemap: https://dataverse.harvard.edu/sitemap/sitemap.xml
# Created initially using: http://www.mcanerin.com/EN/search-engine/robots-txt.asp
# Verified using: http://tool.motoricerca.info/robots-checker.phtml
This seems like a problem...
... as is probably leading to datasets not being indexed by Google, as explained at https://groups.google.com/g/dataverse-community/c/hL4nt-9GQBw/m/lK4twagWAgAJ and https://iqss.slack.com/archives/C010LA04BCG/p1689270974095009
We can use https://search.google.com/test/rich-results to test. Here's my dataset: