cessda / cessda.cdc.versions

Issue track and wiki for the CESSDA Data Catalogue
https://datacatalogue.cessda.eu/
Apache License 2.0
0 stars 0 forks source link

Add DANS Endpoint #667

Open john-shepherdson opened 5 months ago

john-shepherdson commented 5 months ago

Hi everyone, I would like to solve this so that our metadata can be included in the CDC. Asking my colleagues they told me that we have to sets (an EN and a Dutch one) and I believe these were previously used to get our metadata into the CDC. https://ssh.datastations.nl/oai?verb=ListRecords&metadataPrefix=oai_dc&set=CESSDA-EN https://ssh.datastations.nl/oai?verb=ListRecords&metadataPrefix=oai_dc&set=CESSDA-NL

Do I understand it correctly that having these sets is not sufficient? Our metadata actually does include a "language of metadata" attribute which is even now made mandatory, but it might not be included in all of the metadata exports Dataverse provides.

If these sets are insufficient, could you give us more details about what would be required and what is harvested from Dataverse? Since other SPs using Dataverse are included in the CDC I hope we could adjust our exports to comply with your requirements.

Many thanks. Ricarda

Originally posted by @RicardaBraukmann in https://github.com/cessda/cessda.cdc.versions/issues/662#issuecomment-2180919866

We could use the sets as above, but would have to treat them as 2 different endpoints with different names and different default languages. Might be confusing for the users to see publishes called (for example) 'DANS-KNAW (English)' and 'DANS-KNAW (Dutch)' - also the names would not comply with the Publisher names CV (https://vocabularies.cessda.eu/vocabulary/CdcPublisherNames?lang=en)

Originally posted by @john-shepherdson in https://github.com/cessda/cessda.cdc.versions/issues/662#issuecomment-2181069028

Thanks @john-shepherdson for looking into it. For us I would prefer to be included in some way so as soon as possible so if what you say is possible that would be great. Alternatively, you could also for now harvest the English records only as those will be most relevant for CDC users I believe and that set is also our bigger set from the two.

Of course we want to be included full as soon as possible so it would be great if we can discuss how that can be achieved.

Can you specify what we need to do in order to be harvested through our regular endpoint?

We have language of metadata information in our metadata in a custom block so the information is available for most datasets. I am not sure how you harvest the Dataverse instances (i.e. what metadata schema do you use), and what adjustments we would need to make to comply with the requirements? I am happy to connect you with our technical team as well as they know better how things are currently implemented.

Originally posted by @RicardaBraukmann in https://github.com/cessda/cessda.cdc.versions/issues/662#issuecomment-2185835141 See also #662

john-shepherdson commented 5 months ago

We could use the existing sets:

https://ssh.datastations.nl/oai?verb=ListRecords&metadataPrefix=oai_dc&set=CESSDA-EN https://ssh.datastations.nl/oai?verb=ListRecords&metadataPrefix=oai_dc&set=CESSDA-NL

but would have to treat them as 2 different endpoints with different names and different default languages. Might be confusing for the users to see publishes called (for example) 'DANS-KNAW (English)' and 'DANS-KNAW (Dutch)' - also the names would not comply with the Publisher names CV (https://vocabularies.cessda.eu/vocabulary/CdcPublisherNames?lang=en)

john-shepherdson commented 5 months ago

Ricarda Braukmann wrote: "Thanks @john-shepherdson for looking into it. For us I would prefer to be included in some way so as soon as possible so if what you say is possible that would be great. Alternatively, you could also for now harvest the English records only as those will be most relevant for CDC users I believe and that set is also our bigger set from the two.

Of course we want to be included full as soon as possible so it would be great if we can discuss how that can be achieved.

Can you specify what we need to do in order to be harvested through our regular endpoint?

We have language of metadata information in our metadata in a custom block so the information is available for most datasets. I am not sure how you harvest the Dataverse instances (i.e. what metadata schema do you use), and what adjustments we would need to make to comply with the requirements? I am happy to connect you with our technical team as well as they know better how things are currently implemented."

john-shepherdson commented 5 months ago

@KristinaS4 @MortenSikt Your thoughts please.

matthew-morris-cessda commented 5 months ago

Added quotes from @RicardaBraukmann in the issue description

KristinaS4 commented 5 months ago

@john-shepherdson Do you know the status on the language tags for their Dataverse endpoint? It seems as they are willing to adjust their exports to comply with CDC's requirements. This would of course be the optimal solution. Can we do more to support this?

I agree that it will be confusing for users if there are two publishers called DANS-KNAW and in that case I think we should only include the English records as suggested by Ricarda to limit it to one publisher.

john-shepherdson commented 5 months ago

I am not aware that the language tags have been added (but maybe Matthew could have a look at some recently harvested DANS XML files to confirm) in which case the short term fix is only to include the records from the English endpoint.

LauraHuisintveld commented 4 months ago

Dear all, At DANS we do not have the language tags added to our DDI export via OAI-PMH. There is a Dataverse setting for it, but we have not enabled it. I will discuss with Ricarda what is the best way to solve the issue.

alen-vodopijevec-cessda commented 2 months ago

Dear @LauraHuisintveld @RicardaBraukmann

Following up on the e-mail exchange.. CESSDA Metadata Validator is available at: https://cmv.cessda.eu/#!validation

Please check it out for validating your outputs and get back to us when successful or if you have any issues in the meanwhile.

From previous discussions:

Please check the following example for a valid language specification.

Language can be specified at the document level within the element, or at the individual element level within a specific tag.

More examples can be found here

LauraHuisintveld commented 1 month ago

@alen-vodopijevec-cessda Yes, we will test our output with the validator, a very useful tool. One question, should we test with DDI Profile 2.0, or with 1.04 which is also still available within the validator?

alen-vodopijevec-cessda commented 1 month ago

You should use 1.04, passing this validation will guarantee the compliance with the CDC. You can also give it a try and test with 2.0 profile - just curious about the results as this is more strict.

LauraHuisintveld commented 1 month ago

Thanks, we can let you know our results once we are ready. I already played around with the validator a bit, and I have another question. Sometimes our users have used html-code within the . In one case, this resulted in a schema violation. A < b > and < ul > element were used. Is this specific for the validator tool, or will the Data Catalogue not accept these records as well?

matthew-morris-cessda commented 1 month ago

The Data Catalogue will accept these records

LauraHuisintveld commented 1 month ago

Dear all, We have tested our new OAI-PMH output, and we think it is ready now. Could you please test this link and let us know if it works without problems? https://oai-service.labs.dans.knaw.nl/ss/oai?verb=ListRecords&set=social_sciences&metadataPrefix=oai_ddi

MortenSikt commented 1 month ago

I can see in staging that 6749 records are found from DANS-KNAW. Some metadata is presented in the results list, but clicking on any record does not lead to a valid page (its just blank for me).

Not sure if this is before or after configuration @matthew-morris-cessda ?

matthew-morris-cessda commented 1 month ago

I've only just implemented this configuration change so this isn't available yet. I'll update the issue when it is.

LauraHuisintveld commented 3 weeks ago

@matthew-morris-cessda I was wondering if there is any news? Are there still some problems we need to solve at our side?

matthew-morris-cessda commented 2 weeks ago

@LauraHuisintveld Dutch language content is still tagged as English on this endpoint

LauraHuisintveld commented 1 week ago

@matthew-morris-cessda Hmmm, I can see it too now. Maybe something went wrong, I will ask my colleague to take a look and will let you know when to try again.

LauraHuisintveld commented 1 week ago

Hi @matthew-morris-cessda We have found the problem and did a new deploy. Could you please try again? The URL remains the same: https://oai-service.labs.dans.knaw.nl/ss/oai?verb=ListRecords&set=social_sciences&metadataPrefix=oai_ddi

matthew-morris-cessda commented 1 week ago

The problem still persists

image