cessda / cessda.cvs.two

Apache License 2.0
0 stars 2 forks source link

Duplicates in SKOS file output #604

Closed john-shepherdson closed 1 year ago

john-shepherdson commented 1 year ago

Sanda Ionescu, on behalf of the DDI Alliance reported that:

"SKOS file output by CESSDA duplicates every label and description line for each language."

Maja will follow up with Sanda to try to get some concrete examples, as an initial check was unable reproduce the problem.

darrenbell2 commented 1 year ago

@OliverHopt could you provide an example.

MajaDolinar commented 1 year ago

Message from Sanda (DDI): "I have looked at the exports from the Cessda tool (webpage) and there is no duplication of entries. Neither in the individual language exports, nor in the multiple languages exports. The duplication appears only in the RDF exports on the test site for the DDI Alliance CVs. Controlled Vocabularies - Overview Table of Latest Versions | Data Documentation Initiative (ddialliance.org) It seems to me that this is an issue that appears in the process of updating the DDI Alliance page, and should be a "bug" in the "pipeline" that was built to copy over the CVs from the CESSDA page to the DDI Alliance page."

OliverHopt commented 1 year ago

The frontend download access is not mechine actionable (as far as I can see). The API, I use is the following: https://vocabularies.cessda.eu/v2/vocabularies// along with the desired content type in the header. This access point still delivers the doubled language entries.

If there is a change in the API, it would be nice to get infromation about.

Joshocan commented 1 year ago

@OliverHopt Could please provide your query and it s corresponding your response I do not get duplicates when i did performed the test using https://api.tech.cessda.eu/#/vocabulary-resource-v-2/ swagger

OliverHopt commented 1 year ago

@Joshocan I reproduced dupliction on Swagger through values

Curl is: curl -X 'GET' 'https://vocabularies.cessda.eu/v2/vocabularies/GeneralDataFormat/2.0.3' -H 'accept: application/xml'

Using the old two digit versioning, the API respondes without duplications.

matthew-morris-cessda commented 1 year ago

Reproduction successful

...
    <rdf:Description rdf:about="http://rdf-vocabulary.ddialliance.org/cv/GeneralDataFormat/2.0.3/">
        <rdf:type rdf:resource="http://www.w3.org/2004/02/skos/core#ConceptScheme"/>
        <dcterms:isVersionOf rdf:resource="http://rdf-vocabulary.ddialliance.org/cv/GeneralDataFormat"/>
        <skos:notation>GeneralDataFormat</skos:notation>
        <dcterms:title xml:lang="en">General Data Format</dcterms:title>
        <dcterms:description xml:lang="en">Describes the physical format(s) of the data documented in the logical product(s) of a study unit.</dcterms:description>
        <dcterms:title xml:lang="en">General Data Format</dcterms:title>
        <dcterms:description xml:lang="en">Describes the physical format(s) of the data documented in the logical product(s) of a study unit.</dcterms:description>
        <dcterms:title xml:lang="da">Generelt dataformat</dcterms:title>
        <dcterms:description xml:lang="da">Beskriver de(t) fysiske format(er) af den data, der er dokumenteret i de(t) logiske produkt(er) i en studieenhed.</dcterms:description>
        <dcterms:title xml:lang="da">Generelt dataformat</dcterms:title>
        <dcterms:description xml:lang="da">Beskriver de(t) fysiske format(er) af den data, der er dokumenteret i de(t) logiske produkt(er) i en studieenhed.</dcterms:description>
        <dcterms:title xml:lang="da">Type af data format</dcterms:title>
        <dcterms:description xml:lang="da">Beskriver de(t) fysiske format(er) af den data, der er dokumenteret i de(t) logiske produkt(er) i en studieenhed.</dcterms:description>
        <dcterms:title xml:lang="de">Art des Datenformats</dcterms:title>
        <dcterms:description xml:lang="de">Beschreibt das Format der Daten, die in den logischen Produkten einer Studie dokumentiert sind.</dcterms:description>
        <dcterms:title xml:lang="de">Art des Datenformats</dcterms:title>
        <dcterms:description xml:lang="de">Beschreibt das Format der Daten, die in den logischen Produkten einer Studie dokumentiert sind.</dcterms:description>
...

All dc:terms elements are listed at least twice. These duplicates are not present in the JSON representation.

john-shepherdson commented 1 year ago

Moved to milestone 3.3.0 as scope of 3.2.0 was reduced at today's Sprint meeting, where release date for latter was set to 09/05/2023.

john-shepherdson commented 1 year ago

Assigned to MO Tech, in case they have the bandwidth to fix in time for 3.2.0 release, otherwise to be reassigned to Technical Maintainer after that date.

Joshocan commented 1 year ago

@OliverHopt A fix had been pushed , please check in dev or staging if it is resolved. curl -X 'GET' 'https://vocabularies-dev.cessda.eu/v2/vocabularies/GeneralDataFormat/2.0.3' -H 'accept: application/xml'

OliverHopt commented 1 year ago

Resolved the fix. No more duplicates :-)

Thanks