IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

Spike: Get the prod. archive fully reindexed by Google, while mitigating the load on the servers from crawling by the bot #228

Open landreev opened 1 year ago

landreev commented 1 year ago

Googlebot crawling was generating a surprising degree of load and causing real problems as of late; in order to mitigate this load we've been experimenting with limiting or stopping bot access to the holdings while we are looking for more efficient ways of feeding the metadata to them. This is now causing problems, as Google appears to have started dropping some previously indexed datasets from searches (not just taking longer to index newly published content, as was intended). So, this is somewhat urgent, to get everything indexed again, while keeping the servers alive.

There's some overlap with #222, as I'm specifically trying to feed the schema.org metadata exports to Google.

landreev commented 1 year ago

Google is in the process of reindexing the prod. archive. I'm going to keep an eye on the datasets that were specifically reported; if they don't get reindexed in the next couple of days, I'll force-request the bot to come crawl them.

landreev commented 1 year ago

I got the dataset https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/D24VWO re-crawled by googlebot repeatedly during the last couple of days. Unfortunately, it's still not showing up prominently in the google search results when I search for the title. Getting a page reindexed in their search engine can take time (they don't give any guarantees about how long), but I am a little bit worried about it. I will read their documentation some more and try to figure out how to address this if this and a few other datasets like it don't start showing up in searches in a few days.

cmbz commented 1 year ago

2023/09/21: @landreev I added to sprint ready with tentative size of 3. Please resize as needed for this sprint. Also, I changed the title to indicate that it's a spike/investigation.

pdurbin commented 1 year ago

@landreev is it possible we're suffering from this bug for some of our datasets?

landreev commented 1 year ago

Hmm. That's another issue I was not aware of (thank you for mentioning it). But it doesn't look like sitemap is the issue in our case - the bots appear to be reading it, and they appear to be responsive to what's in it. If I change a date for a dataset in it, they appear to come and get it, not instantly, but fairly quickly. The datasets I'm keeping an eye on have been recrawled, but are still not appearing in searches.

(it would be weird, if they kept using sitemaps with >50k entries for crawling, but without indexing the crawled content - ?? - Anyway, I clearly need to keep reading up on it)

landreev commented 1 year ago

This is the dataset mentioned earlier, that somebody complained about specifically, that in turn prompted opening this dedicated issue.
Screen Shot 2023-10-24 at 2 51 23 PM

cmbz commented 10 months ago

2024/01/03: Moved to waiting status during kickoff; need to wait for a while to review.

cmbz commented 4 months ago

2024/07/10