cessda / cessda.cdc.versions

Issue track and wiki for the CESSDA Data Catalogue
https://datacatalogue.cessda.eu/
Apache License 2.0
0 stars 0 forks source link

Add support for harvesting single URL with multiple metadata prefixes #96

Closed cessda-bitbucket-importer closed 3 years ago

cessda-bitbucket-importer commented 5 years ago

Original report on BitBucket by John Shepherdson (GitHub: john-shepherdson).


Some SPs (such as DANS) have a single URL that supports multiple metadata prefixes, in order to distinguish between metadata records in different languages.

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Some SPs (such as DANS) have a single URL that supports multiple metadata prefixes, in order to distinguish between metadata records in different languages.

cessda-bitbucket-importer commented 5 years ago

Original comment by Moses Mansaray (GitHub: doraVentures).


Questions

  1. Can this be fixed at source?

If not, questions to consider

  1. What are the rules/standard for this prefix so that I can write an algorithm that matches what I see currently for DAN [ oai_ddi25_nl | oai_ddi25_en ] and other Repos in future? If there are none here, I can work around this, but it means the configurations for adding repos will be a little more verbose.
  2. When there is a metadataPrefix configuration that also holds the lang should this lang override the Record’s @&zwnj;xml:lang found in the <codebook> and any other sub @&zwnj;xml:lang found in the given xml Record?

Propose solution

Meanwhile I am working on a solution that simply enables harvesting multiple metadataPrefixes for the same repo url without changing any of the code logic paths that make decisions based on the @&zwnj;xml:lang. We can analyse the processed records and iterate on this solution. So far all the DANS records I have played with seem well formed with @&zwnj;xml:lang in the right places regardless of the metadataPrefix, therefore I have confidence would work just fine. Let me know if this is good enough for @john-shepherdson and then test.

Side notes

The metadata standards recommend the record itself to self-declare what language its content is so that applications that are processing it can figure this out efficiently at run time. See example below for DAN (NL) which declares this.

<ddi:codebook ... version="2.5" xml:lang="nl">

https://easy.dans.knaw.nl/oai/?verb=GetRecord&metadataPrefix=oai_ddi25_nl&identifier=oai:easy.dans.knaw.nl:easy-dataset:5181

<ddi:codebook ... version="2.5" xml:lang="en">

https://easy.dans.knaw.nl/oai/?verb=GetRecord&metadataPrefix=oai_ddi25_en&identifier=oai:easy.dans.knaw.nl:easy-dataset:65449

Reasons to solution proposal above:

There is quite a lot of logic already in the code base that work on this premises of the xml Record self-declaring it’s language content and make other routes if the above is not declared. Other part of the code base then, figures out which other languages a particular study is available in. I would not be surprised to find more business rules around this expectation that would need to be reworked.

cessda-bitbucket-importer commented 5 years ago

Original comment by Cessda Techframe (GitHub: cessda).


Proposed solution looks good - the DANS situation is a one off, so I don’t want to change everything else because of this corner case. If necessary, I can ask them to either use a single metadata prefix, or a reverse proxy to provide 2 distinct URLs.

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Answers to questions:

  1. No rules, this should be done via the URL rather than the metadata prefix.
  2. No, the lang in the file takes precedence.
cessda-bitbucket-importer commented 5 years ago

Original comment by Moses Mansaray (GitHub: doraVentures).


@john-shepherdson

RE: If necessary, I can ask them to either use a single metadata prefix, or a reverse proxy to provide 2 distinct URLs.

Yes this is a better solution than mine.

I will pause making the proposed change and recommend you make this request to DAN. I will tackle the next priority items instead.

I will leave the current WIP branches as is for the next few days and eventually tag and delete the branches.

cessda-bitbucket-importer commented 5 years ago

Original comment by Moses Mansaray (GitHub: doraVentures).


Tags:

  1. https://github.com/cessda/cessda.pasc.osmh-indexer.cmm/blob/d820f5acfb64eb7c45bffc4b2a00b016537e90a8/?at=%2396_harvesting-single-url-multiple-metadataformat
  2. https://github.com/cessda/cessda.pasc.osmh-repository-handler.oai-pmh/blob/e3105750f874928850bac2caedc7f48082cc0a7c/?at=%2396_harvesting-single-url-multiple-metadataformat
  3. https://github.com/cessda/cessda.pasc.osmh-repository-handler.nesstar/blob/4e57796573a0f225d65df0ac5b707171a4acc4a3/?at=%2396_harvesting-single-url-multiple-metadataformat
cessda-bitbucket-importer commented 5 years ago

Original comment by Moses Mansaray (GitHub: doraVentures).


Some SPs (such as DANS) have a single URL that supports multiple metadata prefixes, in order to distinguish between metadata records in different languages.

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Some SPs (such as DANS) have a single URL that supports multiple metadata prefixes, in order to distinguish between metadata records in different languages.

cessda-bitbucket-importer commented 5 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Awaiting testing and sign-off

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Estimate of effort required to diagnose and fix: 0.5 day (CONTRACTOR)

cessda-bitbucket-importer commented 4 years ago

Original comment by Moses Mansaray (GitHub: doraVentures).


Before I make a start on this @john-shepherdson did you have any luck with having DAN provide a second url?

Cessda Techframe

Proposed solution looks good - the DANS situation is a one off, so I don’t want to change everything else because of this corner case. If necessary, I can ask them to either use a single metadata prefix, or a reverse proxy to provide 2 distinct URLs.

I will strongly recommend this than adding custom code/rules for DAN who have gone against the standards and CDC expectations. The Effort of adding a reverse proxy to handle multiple metadata prefixes via the same end-pointer is far cleaner, cheaper and maintainable.

Last option: We can still do it in CDC but it would definitely be more than your estimated 0.5 days worth of work as this would be breaking some of the core patterns/logic assumptions made between the CDC OSMH Consumer Indexer and the OSMH Handlers. All current configurations for SPs in the property files across the CDC applications would most likely have to change too.

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


No luck so far, will recontact them.

cessda-bitbucket-importer commented 4 years ago

Original comment by Moses Mansaray (GitHub: doraVentures).


I can’t progress with this until I hear back from you @john-shepherdson re: SP to provide 2 distinct URLs for harvesting.

Assigning to you.

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Unlikely to get response from DANS until January 2020

cessda-bitbucket-importer commented 4 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Add to backlog of next version

cessda-bitbucket-importer commented 3 years ago

Original comment by John Shepherdson (GitHub: john-shepherdson).


Solved by issue #280