Closed cessda-bitbucket-importer closed 11 months ago
DDI issue: “the fundamental problem has always been that the RDF URIs in CVS (see attached example) are all natural language URLs e.g. http://rdf-vocabulary.ddialliance.org/cv/AggregationMethod/1.1.2/#Sum . Critically, these URIs don’t resolve to anything and never have done, so we have a CESSDA CV system that is not interoperable in any meaningful sense. Additionally, these concept URIs should use persistent IDs, not natural language labels. I guess it’s tolerable for the CV itself to have a natural language URI e.g. http://rdf-vocabulary.ddialliance.org/cv/AggregationMethod/1.1.2/ but it’s still not to be encouraged. This is what I communicated to Carsten back in October 2021 but I think it’s still not made its way back properly to the implementers. We have the URIs with appropriate content negotiation working now at https://testrdf-vocabulary.ddialliance.org/cv/AggregationMethod/1.1/d35e61 (referencing the above example) but concept IDs are not currently persisted in the CVS database in an optimal way (Oliver cc’d) to support this seamlessly.”
See related materials in https://drive.google.com/drive/folders/1RpIU1L9E56N-aYGfTVjSVlnTy5uMp_PZ?usp=share_link
From the meeting on May 24: Each concept has unique ID in the database - want to have a consistent ID across versions, currently a new ID is assigned to each concept with a new version - could we have the same ID for a concept across versions? Maybe use the ID of the concept when it was created for the first time, then when a new version is created - Stefan needs to look into the DB and find the best solution. No need to change UI, only DB - that would then be carried into the export.
If there is a change in the code of a concept, but there is a decision to keep the ID, then we need to somehow go into the source and readjust that. The probability for such a case is rare - not sure if really necessary. Storing this in DB: the same; when you change the concept code, you need to make a decision of whether to keep the ID or changing the ID - then you need to interfere in the DB. Also deprecating a concept in this case would be a better solution. It is a decision that needs to be made by the curator of the CV.
notes:
@darrenbell2 what algorithm do you use to generate IDs for exported concepts at present? i assume, you replace the #{notation} strings (e.g. #Sum) in URIs with an alphanumeric ID of length 7 at the DDI side.
The current method of assigning UUIDs to concepts is:
From my perspective, a hash on the code / notation string would be sufficient, although, this would make a "re-coding" of a concept impossible. We discused in the meeting last week, that this would not just be rare but even not existing, because a re-coding would imply to deprecater the existing concept and create a new one.
@darrenbell2 what would be your point of view?
And we might ask the "clients" :-)
Hi Stefan – thanks for getting back to us on this. I have cc’d in Oliver Hopt at GESIS re the identifier question below, as he extracts the information before publishing into BitBucket, at which point we ingest it into Apache Fuseki. Thanks, Darren
From: Stefan Dlugolinsky @.> Sent: 01 June 2023 07:51 To: cessda/cessda.cvs.two @.> Cc: Bell, Darren S @.>; Mention @.> Subject: Re: [cessda/cessda.cvs.two] Concept-level URIs (Issue #455)
CAUTION: This email was sent from outside the University of Essex. Please do not click any links or open any attachments unless you recognise and trust the sender. If you are unsure whether the content of the email is safe or have any other queries, please contact the IT Helpdesk.
notes:
@darrenbell2https://github.com/darrenbell2 what algorithm do you use to generate IDs for exported concepts at present? i assume, you replace the #{notation} strings (e.g. #Sum) in URIs with an alphanumeric ID of length 7 at the DDI side.
— Reply to this email directly, view it on GitHubhttps://github.com/cessda/cessda.cvs.two/issues/455#issuecomment-1571904940, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXGJS64ZWKORBFSOSA73KLXJB6ZDANCNFSM6AAAAAAXMJHBGA. You are receiving this because you were mentioned.Message ID: @.**@.>>
Sorry all, I have seen that Oliver has already replied.. Thanks, Darren
From: Bell, Darren S Sent: 01 June 2023 11:58 To: cessda/cessda.cvs.two @.>; cessda/cessda.cvs.two @.>; Hopt, Oliver @.> Cc: Mention @.> Subject: RE: [cessda/cessda.cvs.two] Concept-level URIs (Issue #455)
Hi Stefan – thanks for getting back to us on this. I have cc’d in Oliver Hopt at GESIS re the identifier question below, as he extracts the information before publishing into BitBucket, at which point we ingest it into Apache Fuseki. Thanks, Darren
From: Stefan Dlugolinsky @.**@.>> Sent: 01 June 2023 07:51 To: cessda/cessda.cvs.two @.**@.>> Cc: Bell, Darren S @.**@.>>; Mention @.**@.>> Subject: Re: [cessda/cessda.cvs.two] Concept-level URIs (Issue #455)
CAUTION: This email was sent from outside the University of Essex. Please do not click any links or open any attachments unless you recognise and trust the sender. If you are unsure whether the content of the email is safe or have any other queries, please contact the IT Helpdesk.
notes:
@darrenbell2https://github.com/darrenbell2 what algorithm do you use to generate IDs for exported concepts at present? i assume, you replace the #{notation} strings (e.g. #Sum) in URIs with an alphanumeric ID of length 7 at the DDI side.
— Reply to this email directly, view it on GitHubhttps://github.com/cessda/cessda.cvs.two/issues/455#issuecomment-1571904940, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXGJS64ZWKORBFSOSA73KLXJB6ZDANCNFSM6AAAAAAXMJHBGA. You are receiving this because you were mentioned.Message ID: @.**@.>>
@OliverHopt @darrenbell2 thanks, but I was asking on the algorithm generating IDs. I assume that DDI would rather continue using the current one to be later consistent with the current alg. after the UIDs will be generated inside the CVS SKOS export routine. I've already prepared some code, which is based on MD5 hashing function; e.g., MD5(code.notation)[:7]. There is a separate export-ddi_rdf.xml template added, which converts all the concept.notation attributes to UIDs.
@Stifo actually we would need to look into the source code of Saxon HE 9 to fully answer the question on the algorithm being used. According to https://www.oreilly.com/library/view/xslt/0596000537/re48.html the method is defined in the XSL szandard to just garanty that the generated ID would remain the same within the same run of the transformation for a given node.
Therefore I allready made sure, that I I reuse the IDs from previous transformations, if available. if CVS would be using a generating mechanism in the future, this will definitly change the IDs being assigned. In that case, we would need to wait with going productive with the DDI Aliance service.
From my perspective, the choice is either to be some kind of fast with the change in CVS to make it your choice. Or to extend the DB schema of CVS to integrate the ID being assigned by the transformation.
If exporting SKOS and the name of the vocabulary's alliance is "DDI Alliance", then a separate template for DDI export-ddi_rdf.xml is chosen. This template replaces all the code.notation occurrences with generated hash codes; e.g. code.generateHash('md5', code.notation, 7). There's an MD5 algorithm used and truncated to the first 7 characters to match the requested ID length. If the length is set to 0, then no truncation is performed. There are also other hashing algorithms available: md2, sha1, sha2, sha256. New hashing algorithms can be added here. Also, Agency's URI codes can be set to generate ID in place of code.notation. To set it up, edit the URI code in CVS App under Agency → Edit (dev).
Hi all - I have checked some example SKOS output for https://vocabularies-staging.cessda.eu/vocabulary/AggregationMethod?lang=en for version 1.2.2. in Danish. A small number of minor things still to correct but I think we're nearly there:
(1) ConceptScheme URI should have trailing slash i.e.
should read
(2) ConceptScheme has a dcterms:title but doesn't not have a skos:prefLabel - which is required by the W3C spec.
(3)
I'll run some additional SKOS validators later but otherwise, it's looking a lot better now,
Many thanks, Darren
@Stifo please have a look at Darren's comment.
Resolved (1), (2), and (3). changes will be available in the dev and staging soon.
@Stifo, just looked at staging. APologies if you haven't deployed yet.
(1) ConceptScheme URIs
Still inconsistent e.g. examples selected randomly:
https://vocabularies-staging.cessda.eu/vocabulary/AggregationMethod?lang=en has export with
(2) ConceptScheme doesn't not have a skos:prefLabel I'm still not seeing a skos:prefLabel for the ConceptScheme in multiple examples
(3) skos:notationshould, strictly speaking, have an xml:lang attribute. I'm still not seeing xml:lang attribute for skos:notation in multiple examples.
@darrenbell2 the deployment is being stucked for both dev and staging. the dev does not even work and gives 503 error response.
@matthew-morris-cessda would you please take a look at the dev/staging deployment?
Fixed. We're in the middle of migrating Docker repositories and an invalid configuration snuck in.
thanks @matthew-morris-cessda! @darrenbell2, it's ready in dev/staging, please, take a look, thanks.
Checked a few random SKOS exports in different languages on Staging
1 ConceptScheme URIs appear to be showing trailing slashes now - consider resolved 2 ConceptScheme now showing skos:prefLabel - consider resolved 3 skos:notation has xml:lang attribute - consider resolved
So, looking good so far. Many thanks, Darren @OliverHopt tagged
Issue resolved. Ready for release.
Original report on BitBucket by Taina Jääskeläinen.
Concept level URIs: These are 7-character alpha-numeric strings and will be produced for the SKOS/RDF exports for DDI vocabularies in the DDI Alliance website.
DDI 2.6 will have vocabInstanceURI, i.e. the concept level URI.
The vocabulary and each concept/code have a persistent identifier.In practical terms in SKOS, this means that the ConceptScheme object and its child Concept objects will both have a Linked Data URI.
The Concept URI pattern is:http://rdf-vocabulary.ddialliance.org/cv/<CV_SHORT_NAME>/<VERSION_NUMBER/<7-character-alphanumeric_id>e.g.http://trdf-vocabulary.ddialliance.org/cv/AnalysisUnit/2.1/d56e194(Concept ID d56e194 with label “OrganizationOrInstitution”).
Also check what is available/visible in the DDI controlled vocabularies pages.
These concept-level URIs will be produced only for the DDI Alliance SKOS/RDF in the CVS-DDI pipeline, they are not in the CVS SKOS/RDF.
When there is need for concept-level URIs in metadata, to be discussed
Will leave this issue for Maja to consider (in 2023 or later) if and when concept-level URIs should be visible in CVS.
CVS API will not have them even for DDI vocabularies unless there is a backward pipeline.