Closed jggautier closed 10 months ago
2023/12/19: Prioritized during meeting on 2023/12/18. Added to Needs Sizing.
2023/12/19: @jggautier and @landreev will retest to see if problem still exists, then determine next steps afterwards. Also, issue should be moved to dataverse.harvard.edu.
So, we have a configured SRDA harvesting client in prod. (harvesting one specific set, GESIS.SRDA
). [Edit: this is us harvesting SRDA content from DataCite's OAI-PMH feed; this is mentioned in Julian's opening comment; the problem is being able to harvest from them directly]
There appears to be some content successfully harvested via this client relatively recently.
Their current working OAI endpoint appears to be https://srda.sinica.edu.tw/oai_pmh/oai. (not "/oai_pmh/oai2.php").
Creating a client with the url above appears to work. If you choose the single available set from the list (srda
) however, an attempt to harvest fails with the noSetHierarchy
response from their server. This is a problem on their end, for sure.
A second attempt, to create a client without selecting a set: this appears to work, I'm seeing some records being harvested: https://demo.dataverse.org/dataverse/srda/
However, you will notice that redirects to the remote locations are NOT working. ☹️
The following appears to be a problem: their OAI server is supplying the record identifiers like this:
10.6141/TW-SRDA-AN010012-1
- i.e., without the doi:
prefix. This is a valid doi, and resolving it, as in https://doi.org/10.6141/TW-SRDA-AN010012-1, works. However our code appears to default to hdl:
(!) - and that doesn't work of course. We just need to make this configurable on the client level, which protocol to default to when the prefix is not supplied.
I will open a new dev. issue for this. But creating a client for this repo is working just fine now, after all these years.
(to be precise, there's not one, but 2 different problems that prevent the redirects from working)
Opened the dev. issue for the redirect issues (linked above).
(this one itself is a non-dev. issue, no PR associated with it, so I dragged it into "In Review" directly, asking @jggautier to take a look before we close it)
Should this issue still be moved to the Harvard Dataverse repo?
It sounds like we should let the SRDA folks know that when we try to harvest their srda set, it fails with "the noSetHierarchy response from their server". Is that right? I'd be happy to email them to let them know that this is preventing us from harvesting from that set and ask if they can look into it.
Since we're able to harvest from them when we don't specify, I wonder if we can do that instead. I can also ask them if we can do that.
Since we are closing this issue, idk if it's worth moving it to the local repo - but, up to you.
Yes, we should just harvest from them without specifying the set. That's what their server supports. The only, minor problem on their end is that their server is for whatever reason advertising this unsupported set under ListSets. I mentioned it just to warn you not to select it when configuring the client. You may want to let them know. But no, it isn't preventing us from harvesting from them.
BTW, why was it important to harvest from them directly - as opposed to harvesting their records from Datacite, as set up in prod.? Unfortunately, the records in prod. harvested via that client are not properly redirecting at the moment - but that's because of the bug that I opened #10254 for (and I'm really hoping to fix it asap).
Is the content expected to be different, between what we get from their own OAI vs. Datacite? (it looks like there are different numbers of records served between the two).
I assumed that they created their own harvesting server and emailed us to avoid the issue(s) that Dataverse used to have with harvesting sets from DataCite. In our emails with them I pointed out that issue, but I didn't ask them explicitly why they want us to harvest from their own OAI. I also wondered if they wanted us to harvest from their own OAI because they wanted more control over what we harvested.
I've been planning to email them again with our progress. Want me to ask them why exactly they'd like us to harvest from their OAI instead of from DataCite?
I was just curious, really. It doesn't really matter. We can harvest either way, direct or via Datacite. And, once #10254 is addressed, the harvested records will even be useful/usable (as in, our users will be able to get to their site by clicking on the search records).
It's up to you - but maybe we should wait for it to be fixed before contacting them? So that we could show them harvested and working records, even if it's on one of our test servers - otherwise it just doesn't feel like "progress", when the harvested records are all broken - ?
Yeah I agree. I could email them to let them know that Harvard Dataverse isn't able to harvest from the set they asked us to harvest from and ask them why'd they'd like us to harvest from that set, as opposed to harvesting everything in the repository by not specifying a set, which will work once https://github.com/IQSS/dataverse/issues/10254 is addressed.
I emailed the folks at SRDA to let them know that a problem on their side prevents Harvard Dataverse from harvesting from their srda set, to let them know that Harvard Dataverse is able to harvest all of their metadata without specifying a set, and to ask if they would like Harvard Dataverse to harvest from that srda set or harvest without specifying the srda set.
I'd like to close this issue (for accounting purposes; I will also resize to 10, since we have put more work into it this week). Would you mind creating a new issue in the local repo, something like "Harvest metadata from SRDA", to keep track of the remaining effort? (or I can create it there)
Great, closing this issue sounds okay to me since SRDA folks let us know yesterday that it's fine for Harvard Dataverse to harvest from them without specifying a set and you wrote that Harvard Dataverse is able to do that.
I'll close this issue with this comment, adjust the harvesting client for SRDA, and start the re-harvesting today.
There's more info in our email thread with the folks at SRDA about what's going on with that srda set and why they're recommending using their OAI instead of harvesting from DataCite.
Great, thanks. As for starting a new harvest, I would wait until 6.1 is in prod.
Ah okay. Although I already edited the harvesting client and told Harvard Dataverse to re-harvest.
Why would you wait until 6.1 is in prod? Is it because until https://github.com/IQSS/dataverse/issues/10254 is addressed, clicking on the dataset titles won't lead users to the dataset, and that fix won't be applied to metadata that was harvested before https://github.com/IQSS/dataverse/issues/10254 is addressed?
I'm going to include a quick patch into 6.1 as deployed here that will fix the redirects, yes. It will fix all the existing records with broken redirects, so that's not a problem. I was just suggesting to wait until the redirects are fixed.
Ah okay. So I can create a new issue in the Harvard Dataverse repo like you suggested, to keep track of things to do to harvest SRDA's metadata.
I'm unable to create harvesting clients in the Harvard Dataverse Repository and Demo Dataverse repository using SRDA's own OAI-PMH feed. It's base URL is https://srda.sinica.edu.tw/oai_pmh/oai2.php.
Identifying it works - https://srda.sinica.edu.tw/oai_pmh/oai2.php?verb=Identify - and so does listing records -https://srda.sinica.edu.tw/oai_pmh/oai2.php?verb=ListRecords&metadataPrefix=oai_dc.
But when trying to create a client, Harvard Dataverse Repository and Demo Dataverse show errors about the base URL being an "Invalid URL. Failed to establish connection and receive a valid server response."
Harvard Dataverse Repository is harvesting SRDA's records into https://dataverse.harvard.edu/dataverse/srda_harvested, using DataCite's OAI-PMH feed. The admins created their own feed and emailed the repository's support to ask that Harvard Dataverse Repository to use that feed instead of the records that DataCite has.
The SRDA repository's admins are troubleshooting and leaving updates in the support email thread at https://help.hmdc.harvard.edu/Ticket/Display.html?id=287243. They have already changed the base URL and may change it again so when this is investigated, that email thread should be checked for the latest info.