IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

Re-harvest from Borealis Repository #172

Open jggautier opened 2 years ago

jggautier commented 2 years ago

The harvesting client that harvested metadata from Scholar's Portal isn't listed in the Manage Harvesting Clients page and isn't in the clients list returned by the API endpoint for listing clients (https://dataverse.harvard.edu/api/harvest/clients).

So I'm not able to manage that client, such as to re-harvest from Borealis so that the links behind the dataset titles lead users to the datasets instead of an error page.

The client was listed until the following steps caused this bug:

The OAI set that needs to be harvested by the Harvard Dataverse Repository contains 8,237 records as of this writing, but https://dataverse.harvard.edu/dataverse/borealis_harvested includes only 7,401 records, which I think was the same number of records that the Dataverse collection had before I tried to delete the harvesting client.

sbarbosadataverse commented 1 year ago

What's the likelihood this issue will be fixed with the Harvesting updates in progress? @mreekie @siacus We don't want to add this to the Dataverse Backlog for Harvard Dataverse if they may get fixed by the harvesting updates.

Thanks

cmbz commented 10 months ago

2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing.

cmbz commented 10 months ago

2023/12/19: @jggautier and @landreev will followup after meeting. Sizing at a 10 initially.

landreev commented 10 months ago

After a brief review, we may have an idea of what caused the weird behavior described above, during the attempt to delete the client, and with the collection view after that. But regardless of the exact details, the Scholars Portal client, and all the harvested datasets associated with it was indeed deleted in the end. So the next step should be to create a brand new client to harvest from their current OAI and to give it a try.

jggautier commented 10 months ago

I tried harvesting from Borealis into the collection at https://dataverse.harvard.edu/dataverse/borealis_harvested, but no records are showing up.

I was able to create a client at https://dataverse.harvard.edu/harvestclients.xhtml?dataverseId=1, so there's a row on the page's clients table, and when I press that row's button to start the harvest, the page shows the blue banner message telling me that the harvest started.

But the Last Run and Last Results columns for that row are empty, when I expected to see "IN PROGRESS", even after I refresh my browser and check the clients table on other computers.

Maybe the harvesting job is failing for some reason. Could you tell what's happening @landreev?

landreev commented 10 months ago

OK, I'll take a look. But if it's not something I can figure out right away, it'll have to wait till next year.

landreev commented 10 months ago

I was able to start it from the Harvesting Clients page (showing "IN PROGRESS" now). Expired session, or something like that maybe? Seeing some results in the collection now. Let's see how it goes, how many failures we get, etc. (If there are too many, we could maybe try ddi instead?)

landreev commented 10 months ago

OK to change the title of the issue to "Re-Harvest from Borealis Repository", or something along these lines?

jggautier commented 10 months ago

Changing the title makes sense to me. Just changed it.

landreev commented 10 months ago

That's a nasty success to failures ratio. ☹️ Some controlled vocab., or metadata block mismatch between our sites maybe, that makes harvesting in native json impossible? - I'll try to take a closer look before the NY.

Deleting this client (and all the harvested content with it), then re-creating it with the DDI as the format, just to see how that goes could be an interesting experiment too.

landreev commented 10 months ago

About 700 of these errors in the harvest log: incorrect multiple for field productionPlace I'm assuming this means that they are running the version of the citation block where this is still a single value-only field. (Makes sense, since they are on 5.13). Must be more problems like this with other fields and/or blocks.

jggautier commented 10 months ago

Bikramjit Singh, a system admin from Scholar's portal, wrote in the related email thread that Borealis plans to upgrade its Dataverse software early next year.

landreev commented 10 months ago

Good to know. It's still important to keep in mind that harvesting in native json is always subject to problems like this. For example, the problem I mentioned above - the specific field that we recognize as a multiple but the source instance does not - would not be an issue if we were harvesting DDI.

bikramj commented 10 months ago

Thank you. Tagging Borealis developers @JayanthyChengan and @lubitchv, if they can help with this.

jggautier commented 10 months ago

Harvard Dataverse was able to harvest 5,866 records into https://dataverse.harvard.edu/dataverse/borealis_harvested, and the table on the "Manage Harvesting Clients" shows that 13,856 records failed to be harvested.

I'm tempted to try to delete this client and try harvesting using its DDI-C metadata instead. @landreev should I try that?

amberleahey commented 8 months ago

Thank you. Tagging Borealis developers @JayanthyChengan and @lubitchv, if they can help with this.

Hi all -- we have several sets for OAI , the main ones should work https://borealisdata.ca/oai?verb=ListRecords&metadataPrefix=oai_dc or https://borealisdata.ca/oai?verb=ListIdentifiers&metadataPrefix=oai_dc or https://borealisdata.ca/oai?verb=ListSets Is this what is needed for the Harvard Dataverse Harvesters?

landreev commented 8 months ago

@jggautier It looks like I missed your question here back in January - but yes, if you want to experiment with deleting the existing client + records and re-harvesting from scratch, yes please go ahead. Please note that it'll take some time, possibly some hours, to delete 6K records. Also, please run large harvests like this on dvn-cloud-app-2 specifically.

jggautier commented 8 months ago

Thanks @amberleahey. Yes, I'm going to try harvesting Borealis metadata using the DDI metadata format instead of the Dataverse native json format that I usually use.

I just told Harvard Dataverse to delete the client that was trying each week to harvest metadata from Borealis, and I'll check tomorrow since it'll take a while like @landreev wrote.

Then if the client and all harvested metadata has been deleted, I'll make sure that I'm on dvn-cloud-app-2, create a new client without specifying a set so that metadata from all datasets published in Borealis are harvested into the collection at https://dataverse.harvard.edu/dataverse/borealis_harvested, tell Harvard Dataverse to harvest all metadata from Borealis, and see if it gets the metadata of all 20k datasets

jggautier commented 8 months ago

The client isn't listed on the table on the Manage Harvesting Client page anymore and all records were deleted when I checked this morning.

I made sure I was on dvn-cloud-app-2.lib.harvard.edu, created a new client without specifying a set, using the oai_ddi format and the "Dataverse v4+" "Archive Type", and told Harvard Dataverse to start harvesting.

Records are being added to https://dataverse.harvard.edu/dataverse/borealis_harvested. So far so good!

I'll check tomorrow to many of the 20,093 datasets were harvested.

jggautier commented 8 months ago

Hmmm, the Manage Harvesting Client page says that 19207 records were harvested and 453 records failed: Screenshot 2024-03-01 at 5 56 26 PM

But there are 4,777 in https://dataverse.harvard.edu/dataverse/borealis_harvested. I'm not sure where the client page gets the 19207 number from. @landreev, any ideas?

I haven't tried to see why most records failed to be harvested, which I might do by comparing the oai_ddi metadata of records that were harvested to the oai_ddi metadata of records that weren't.

Maybe there's another way to get more info about the failures?

landreev commented 7 months ago

I wasn't able to tell what was up with the mismatched counts above right away. Going to take another, closer look.

landreev commented 7 months ago

As always, the simplest explanation ends up being the correct one. The extra 15K records were successfully harvested, they are in the database, but they didn't get indexed in solr. The reason they didn't get indexed was that solr apparently was overwhelmed and started dropping requests. I am not sure yet whether it was overwhelmed by this very harvest - by having to index so many records in a row in such a short period of time, or if it was having trouble for unrelated reasons (like bots crawling collection pages). I have some suspicions that it may indeed be the former. One way or another, we do need an extra mechanism for monitoring for unindexed datasets in the database (harvested or local), and reindexing them after the fact. That will be handled outside of this issue.

I am reindexing these extra Borealis records now, so the numbers showing on the collection page should be growing. (but I'm going through them slowly/sleeping between datasets, to be gentle on solr)

landreev commented 7 months ago

I stopped it after 4K datasets. But we will get it reindexed eventually.

landreev commented 7 months ago

The status of this effort: the actual harvesting of their records is working ok now, with a fairly decent succes-to-failure ratio using the ddi format. The problem area is indexing - the harvested records do not all get indexed in real time, and therefore do not show up in the collection. This is issue is outside of the harvesting framework and has to do with the general solr/indexing issues we are having in production (new investigation issue IQSS/dataverse#10469), but the effect of it is especially noticeable in this scenario, an initial harvest of a very large (thousands+ of datasets, 20K+ in this case) - it is simply no longer possible to quickly index that many datasets in a row.

I got a few more Ks of the Borealis datasets indexed, but, again, had to stop since it was clearly causing solr issues.

jggautier commented 6 months ago

Thanks! I think this might be the case with at least one other Dataverse installation that Harvard Dataverse harvests from.

There are 270 unique records in the "all_items_dataverse_mel" set of the client named icarda (https://data.mel.cgiar.org/oai). And the copy of the database I use says that there are 270 datasets in the collection that we harvest those records into, although that copy hasn't been updated in about a month.

But only 224 records appear in that collection in the UI and from Search API results.

cmbz commented 4 months ago

2024/07/10

jggautier commented 3 months ago

We haven't told Harvard Dataverse to update the records it has from Borealis Repository. I think @landreev and others continue work on indexing improvements that will help Harvard Dataverse harvest more records, or more specifically ensure that those records appear on search pages, so harvesting, or updating the records that Harvard Dataverse has harvested from any repositories, has been put on hold.

I think the same is true for all GitHub issues listed in https://github.com/IQSS/dataverse-pm/issues/171 that are about ensuring that Harvard Dataverse is able to harvest all records from the repositories it harvests from and would like to start harvesting from.

@landreev I hope you don't mind that I defer to you about the status of that indexing work.

cmbz commented 3 months ago

Assigning you both @landreev @jggautier to monitor this issue. Thanks.

jggautier commented 1 month ago

Like other GitHub issues about Harvard Dataverse harvesting from other repositories, this issue is on hold pending work being done to improve how Dataverse harvests.