Closed cessda-bitbucket-importer closed 3 years ago
Original comment by John Shepherdson (GitHub: john-shepherdson).
When I run this command https://dataverse.arch.be/oai?verb=ListIdentifiers&metadataPrefix=oai_ddi
all the record headers that are returned are marked as deleted.
Also, there are a couple of errors and 1 recommendation flagged by the OVAL BASE OAI-PMH validator
Original comment by John Shepherdson (GitHub: john-shepherdson).
Hello John,
Thanks for your quick feedback.
Since our goal really is to be the Belgian CESSDA service provider, we configured our server endpoint so that, in principle, all published records can be harvested without exception. We did so by entering as harvesting criterion (“Definition Query” in Dataverse) our DOI prefix in the PersistendId field:
Regarding the very first error that you received, we believe this is intrinsic to Dataverse. We mentioned it to IQSS, the developers of Dataverse, and they acknowledged it, though it seems like it is not a major hindrance: https://github.com/IQSS/dataverse/issues/4597
As for the Dublin Core-related errors, I am not sure what to make of them. Doesn’t your client endpoint look for DDI records instead of Dublin Core?
Best regards,
Benjamin
Original comment by John Shepherdson (GitHub: john-shepherdson).
Benjamin,
I am not familiar with the configuration of Dataverse OAI-PMH endpoints, but when I run this query https://dataverse.arch.be/oai?verb=ListSets I get
<request verb="ListSets">
https://dataverse.arch.be/oai
</request>
<error code="noSetHierarchy">
This repository does not support sets
</error>
So I don't see how to make a request that uses a set name or set spec.
Re the two format errors, yes that does seem to be a feature of Dataverse. I've run the validator against another Dataverse endpoints and got the same errors. I wasn't sure if it was something that could be fixed by changing the default configuration. Possibly not, from what you are saying.
Re the DC errors, you are correct that we are only interested in harvesting DDI, I guess the validator mentions it because the OAI-PMH standard says that an endpoint must serve valid DC. Anything else is a bonus. So that point is for your information only.
Regards,
John
Original comment by John Shepherdson (GitHub: john-shepherdson).
From: Shepherdson, John W
Sent: 17 June 2020 12:20
To: Peuch Benjamin
Cc: Ouahalou Youssef; support AT cessda
Subject: Re: OAI-PMH - SODA (Belgium)
Benjamin,
Congratulations, you are in the CESSDA Catalogue.
You can see the 1 available SODA record in the test version of the CESSDA Catalogue here:
username: xxxxx
password: XXXXXXX
However, it appears in the English records section, because you either have not set the language in the record header, or you have set it incorrectly. As per the feedback from the BASE validator, you must use the 2 letter language code, not "French". English is the default, if no other language setting is detected.
The 'Go to publisher' button is not present, as the StudyUrl value is not set.
Regards,
John
Original comment by John Shepherdson (GitHub: john-shepherdson).
From: John Shepherdson
Date: Tue, 14 Jul 2020 at 14:43
Subject: Re: OAI-PMH - SODA (Belgium)
To: Peuch Benjamin
Cc: Ouahalou Youssef
Benjamin,
For development purposes, we have a workaround whereby we set the default language of your records from English to French.
Not a perfect solution, as some of your records are partially in English (Study Title, Creator, Study Persistent Identifier, Abstract) as can be seen in the screenshot.
Besides that, what we are seeing is that you have 42 records, of which 38 are marked as deleted
3 marked as active:
doi:10.34934/DVN/6DXRZR
doi:10.34934/DVN/8H1PTW
doi:10.34934/DVN/ZOSBZG
and one marked as inactive:
doi:10.18419/darus-444
Is this what you were expecting us to see?
Regards,
John
Original comment by John Shepherdson (GitHub: john-shepherdson).
Hello John,
Sorry for not replying to your last mail.
We did some research into the problems we encountered and for now we have only run into dead-ends:
· IQSS confirmed to us that the DDI output of Dataverse cannot be altered within the software;
· DANS have not yet configured their OAI-PMH for the CDC;
· And we haven’t heard from AUSSDA in a while (I poked them just today).
I’m afraid we’re going to have to configure middleware that reprocesses Dataverse’s DDI output to make it CDC-compliant.
Thank you for the language workaround. Indeed, we foresee a lot of multilingualism in our Dataverse on account of Belgium having three official languages.
To your question: Is this what you were expecting us to see?
The answer is: Yes, for now, everything that should filter through, dataset-wise, successfully did.
I reckon that, since there are several problems with our current metadata in the light of the CDC rules, we had better hold off on the OAI harvesting until we have developed a solution to adapt our DDI output so that it is fully CDC-compliant. We would especially not want to have early, non-conforming records that linger behind the later, compliant ones.
Does this sound like a good idea to you?
Best regards,
Benjamin
Original comment by John Shepherdson (GitHub: john-shepherdson).
Benjamin,
When it come to multilingual records, you will certainly need some way of modifying your output. Our workaround is mono-lingual.
I will hold-off on making your endpoint available in production until you are ready.
Continuing to harvest it in dev and staging instances will provide a feedback mechanism for you.
Regards,
John
Original comment by John Shepherdson (GitHub: john-shepherdson).
Hello John,
Yes indeed. We are glad that we can run these tests and ensure that we put out clean records for the CDC to harvest.
We will come back to you once we make progress on this front. Thanks for your help.
Best regards,
Benjamin
Original comment by John Shepherdson (GitHub: john-shepherdson).
Benjamin,
Your endpoint isn't compliant with the Deleted Records part of the OAI-PMH spec, which states that deleted records must return a header when a GetRecord
request is made.
This has caused the CDC Harvester to throw errors, which is how we noticed the problem.
Regards,
John
Original comment by John Shepherdson (GitHub: john-shepherdson).
Hello John,
Thank you for your message. We are going to look into this.
I should also mention that, because of the problems we ran into, we are not going to make our metadata harvestable for the CDC right away. We are launching our data archive in October of this year so we are currently focussing on this, but as soon as the machine is up and running and we have more time on our hands, we will build the middleware we mentioned to process the Dataverse JSON outputs and recreate CMM- and CDC-compliant DDI.
I’ll keep you posted in any case.
Thanks again and best regards,
Benjamin
Original comment by John Shepherdson (GitHub: john-shepherdson).
Benjamin,
Thanks for the update. We are aiming to release a version of CDC in January 2021 which features some new endpoints. Hopefully we will be able to include yours.
Regards,
John
Original comment by John Shepherdson (GitHub: john-shepherdson).
Check whether or not to include SODHA endpoint before releasing v2.3
Original comment by John Shepherdson (GitHub: john-shepherdson).
Benjamin,
Belated congratulation, this is very good news.
I'm following up on our previous correspondence regarding whether or not your OAI-PMH endpoint should be included in the next reloads of the Data Catalogue, which is due next week.
I tested your endpoint URL today (https://dataverse.arch.be/oai?verb=Identify) and was redirected to https://www.sodha.be. From that, and your earlier statement ('We plan to develop middleware to augment our output and make it CDC-compliant'), I conclude that it is not yet ready for inclusion. Is that correct?
Regards,
John
Original comment by John Shepherdson (GitHub: john-shepherdson).
Hello John,
Unfortunately yes, that is correct: the middleware is not ready yet.
It is however one of our priorities. I will keep you posted.
Regarding URLs, we have indeed changed that of our endpoint. It is now https://www.soha.be. This should not change anytime soon.
Best regards,
Benjamin
Original comment by John Shepherdson (GitHub: john-shepherdson).
@matthew-morris-cessda Please remove SODHA from list of endpoints to harvest
Original comment by Matthew Morris (GitHub: matthew-morris-cessda).
SODHA will not be harvested in 2.3.0
Original comment by Matthew Morris (GitHub: matthew-morris-cessda).
The indices will need to be cleared to remove any remaining SODHA studies.
Original report on BitBucket by John Shepherdson (GitHub: john-shepherdson).
From: Peuch Benjamin
Sent: 17 June 2020 10:06
To: Shepherdson, John W
Cc: Ouahalou Youssef
Subject: RE: OAI-PMH - SODA (Belgium)
Hello John,
I believe this is it: https://dataverse.arch.be/oai?verb=Identify
There is only one published dataset in our Dataverse at the moment, but we could successfully harvest it during internal testing.
Best regards,
Benjamin