IQSS / dataverse.harvard.edu

Custom code for dataverse.harvard.edu and an issue tracker for the IQSS Dataverse team's operational work, for better tracking on https://github.com/orgs/IQSS/projects/34
5 stars 1 forks source link

Re-harvesting datasets from Roper Center #204

Open jggautier opened 1 year ago

jggautier commented 1 year ago

Managers of the Roper Center for Public Opinion Research emailed to let us know that they now make their dataset metadata available over OAI-PMH. See https://help.hmdc.harvard.edu/Ticket/Display.html?id=330637.

But when I tested harvesting the records, Demo Dataverse wasn't able to harvest any:

Screen Shot 2022-12-03 at 3 15 48 PM

When I created the harvesting client, for the "Archive Type" I selected "Generic OAI Archive". Screen Shot 2022-12-02 at 12 53 06 PM

I tried the "Roper Archive" option, too, but that didn't work either.

I let the folks at Roper know that the Dataverse development team is working on improving how Dataverse harvests using OAI-PMH, and that once those improvements made it to the Harvard repository and Demo Dataverse, I would try to harvest again.

I also asked them what we should do about the stale records in the Harvard repository (https://dataverse.harvard.edu/dataverse/roper) whose links lead to error pages. Similar to the stale ICPSR records (https://github.com/IQSS/dataverse.harvard.edu/issues/63), people who find these Roper datasets and realize that the links don't work could still go to Roper's website (or even try a general search engine) and search there by the dataset's title.

So maybe we could leave them there until we're able to re-harvest them using OAI-PMH.

Someone at Roper asked in the email thread if, in the meantime, we're able to make the links redirect to the dataset pages:

It looks like the links were directed at our legacy server which has been replaced. From what I can see, the links are going through a resolver on your end, so for example https://dataverse.harvard.edu/dataset.xhtml?persistentId=hdl:1902.4/GBSSLT62-CQ266 will end up at https://ropercenter.cornell.edu//CFIDE/cf/action/catalog/abstract.cfm?archno=GBSSLT62-CQ266 (old URL)

Can your resolver point to a different URL prefix? https://ropercenter.cornell.edu/ipoll/study/GBSSLT62-CQ266 (new URL)

Or maybe we could remove them sooner (e.g. using the destroy dataset API endpoint)?

So for now I plan to:

Definition of done: When we're able to harvest the metadata from all datasets in Roper's OAI-PMH feed and we remove the stale records that are in https://dataverse.harvard.edu/dataverse/roper

jggautier commented 1 year ago

Folks at the Roper Center let me know that they changed things on their end so that the links that the Harvard Dataverse has for Roper datasets (in the collection at https://dataverse.harvard.edu/dataverse/roper) lead to the datasets (instead of those links leading to error pages like they did earlier this year).

I haven't tried again to use Roper Center's OAI-PMH feed to harvest. I'll try again today or next week and report here and in the email thread with the folks from Roper.

landreev commented 1 year ago

This is cool! TBH, I gave up on the Roper records we have in the database a while ago, I just assumed they were useless. They are most likely way out of date, even if some are resolving now. Ideally, we do want to drop that collection and reharvest everything, if they have a functioning OAI interface. But there is no guarantee we will be able to process their records on the first try (so, there's a chance that if we try that, we'll end up with fewer useful records than we have now...). So what we should do is probably start a harvest of their holdings on one of the test boxes - dataverse-internal maybe? - and see how that works.

jggautier commented 1 year ago

Ah okay. I'm not able to start a harvest on one of the test boxes. I was going to use Demo Dataverse, but I won't now.

It sounds better to me if someone else uses a test box. It's more likely that whoever can do that will also be more capable of figuring out what went wrong if something goes wrong.

And thinking more about it, it's probably better that anyone who continues to work on this wait for when @sbarbosadataverse can prioritize this on the "Harvard Dataverse Repository Instance" column on the Dataverse Global backlog.

landreev commented 1 year ago

OK, I'll do that.

jggautier commented 1 year ago

Gene Wang from the Roper Center's been following up regularly about this. I can let him know we haven't looked into this more, yet. But I think it would be helpful if we could say when we could try harvesting from them, even if it's not right away. Is that possible?

landreev commented 1 year ago

OK, I'll do it (an experimental harvest) this week, maybe even today. dataverse-internal is really not a good server for that (it's being used for testing PRs and needs to be restarted constantly), but I'm thinking of trying it on the perf cluster. Will post any updates here.

landreev commented 1 year ago

Just deleting all the old, stale Roper from the prod. database is going to be a little non-trivial. I've experimented with that a bit this week on the perf cluster (using a copy of the prod. db there). If you do it the supported way, through the harvesting clients panel, our application attempts to delete all the records at once, and that's a bulky trunsaction with 20+K datasets. I'd like to avoid having to delete them by one by one, so I'm figuring that part out. Their OAI server is not working properly as of today, I'm talking to an engineer at Roper via the linked RT.

cmbz commented 8 months ago

2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing.

cmbz commented 8 months ago

2023/12/19: Roper's OAI does not implement the OAI Dublin Core. Unclear on their approach. @landreev will contact them to follow up, and determine next steps.

landreev commented 6 months ago

I don't have much in terms of a status update. I haven't been able to re-test their OAI server because it's been down or broken for the past few days. I.e. all of these calls are returning a 500:

https://api.ropercenter.org/prod/api/oai2?verb=ListIdentifiers&metadataPrefix=oai_dc
https://api.ropercenter.org/prod/api/oai2?verb=ListRecords&metadataPrefix=oai_dc
https://api.ropercenter.org/prod/api/oai2?verb=GetRecord&identifier=10.25940/ROPER-31095120&metadataPrefix=oai_dc

On the other hand,

https://api.ropercenter.org/prod/api/oai2?verb=ListSets
https://api.ropercenter.org/prod/api/oai2?verb=ListMetadataFormats

are working; so their OAI server is still there - just not working properly. I really wanted to retest the output of their GetRecord and ListRecords implementation before I reach out to them again. The main problem with their records, the last time I tried harvesting from them, was that instead of the standard tags <oai_dc:dc ...> ... </oai_dc:dc>, as required by the OAI-PMH standard, their records were formatted with <ns2>...</ns2> (?). I would prefer to check if that was still the case, before reaching out and asking about it.

I'm a little self-conscious following up on that RT ticket (330637), since it's so old and since we (I) have dropped the ball on it before. But if their OAI doesn't come back to live miraculously in the next couple of days, I'll reach out and ask.

jggautier commented 6 months ago

Email Debt Forgiveness Day is on Feb 29 😛

(I'm kidding of course!)

cmbz commented 6 months ago

Resized to 3 during sprint kickoff

landreev commented 6 months ago

Their OAI service was not showing any intent to "fix itself", so I finally emailed them via RT - hoping that the people on the other end of the ticket are still employed, and willing to talk to me.

landreev commented 6 months ago

Resumed communication with the developer(s) on the Roper side. Hopefully we'll nail it this time around.

Happy International Email Debt Forgiveness Day!

cmbz commented 1 month ago

2024/07/10