cessda / cessda.cvs.two

Apache License 2.0
0 stars 2 forks source link

Concept-level URIs #455

Closed cessda-bitbucket-importer closed 11 months ago

cessda-bitbucket-importer commented 1 year ago

Original report on BitBucket by Taina Jääskeläinen.


Concept level URIs: These are 7-character alpha-numeric strings and will be produced for the SKOS/RDF exports for DDI vocabularies in the DDI Alliance website.

DDI 2.6 will have vocabInstanceURI, i.e. the concept level URI.

The Concept URI pattern is:http://rdf-vocabulary.ddialliance.org/cv/<CV_SHORT_NAME>/<VERSION_NUMBER/<7-character-alphanumeric_id>e.g.http://trdf-vocabulary.ddialliance.org/cv/AnalysisUnit/2.1/d56e194(Concept ID d56e194 with  label “OrganizationOrInstitution”).

Also check what is available/visible in the DDI controlled vocabularies pages.

These concept-level URIs will be produced only for the DDI Alliance SKOS/RDF in the CVS-DDI pipeline, they are not in the CVS SKOS/RDF.

When there is need for concept-level URIs in metadata, to be discussed

Will leave this issue for Maja to consider (in 2023 or later) if and when concept-level URIs should be visible in CVS.

CVS API will not have them even for DDI vocabularies unless there is a backward pipeline.

MajaDolinar commented 1 year ago

DDI issue: “the fundamental problem has always been that the RDF URIs in CVS (see attached example) are all natural language URLs e.g. http://rdf-vocabulary.ddialliance.org/cv/AggregationMethod/1.1.2/#Sum . Critically, these URIs don’t resolve to anything and never have done, so we have a CESSDA CV system that is not interoperable in any meaningful sense. Additionally, these concept URIs should use persistent IDs, not natural language labels. I guess it’s tolerable for the CV itself to have a natural language URI e.g. http://rdf-vocabulary.ddialliance.org/cv/AggregationMethod/1.1.2/ but it’s still not to be encouraged. This is what I communicated to Carsten back in October 2021 but I think it’s still not made its way back properly to the implementers. We have the URIs with appropriate content negotiation working now at https://testrdf-vocabulary.ddialliance.org/cv/AggregationMethod/1.1/d35e61 (referencing the above example) but concept IDs are not currently persisted in the CVS database in an optimal way (Oliver cc’d) to support this seamlessly.”

See related materials in https://drive.google.com/drive/folders/1RpIU1L9E56N-aYGfTVjSVlnTy5uMp_PZ?usp=share_link

MajaDolinar commented 1 year ago

From the meeting on May 24: Each concept has unique ID in the database - want to have a consistent ID across versions, currently a new ID is assigned to each concept with a new version - could we have the same ID for a concept across versions? Maybe use the ID of the concept when it was created for the first time, then when a new version is created - Stefan needs to look into the DB and find the best solution. No need to change UI, only DB - that would then be carried into the export.

If there is a change in the code of a concept, but there is a decision to keep the ID, then we need to somehow go into the source and readjust that. The probability for such a case is rare - not sure if really necessary. Storing this in DB: the same; when you change the concept code, you need to make a decision of whether to keep the ID or changing the ID - then you need to interfere in the DB. Also deprecating a concept in this case would be a better solution. It is a decision that needs to be made by the curator of the CV.

Stifo commented 1 year ago

notes:

@darrenbell2 what algorithm do you use to generate IDs for exported concepts at present? i assume, you replace the #{notation} strings (e.g. #Sum) in URIs with an alphanumeric ID of length 7 at the DDI side.

OliverHopt commented 1 year ago

The current method of assigning UUIDs to concepts is: performed on the concept node during a XSL transformation. This is only run in case of a new concept. If the concept already exists in a previous version of the conept scheme / vocabulary, it gets reused.

From my perspective, a hash on the code / notation string would be sufficient, although, this would make a "re-coding" of a concept impossible. We discused in the meeting last week, that this would not just be rare but even not existing, because a re-coding would imply to deprecater the existing concept and create a new one.

@darrenbell2 what would be your point of view?

And we might ask the "clients" :-)

darrenbell2 commented 1 year ago

Hi Stefan – thanks for getting back to us on this. I have cc’d in Oliver Hopt at GESIS re the identifier question below, as he extracts the information before publishing into BitBucket, at which point we ingest it into Apache Fuseki. Thanks, Darren

From: Stefan Dlugolinsky @.> Sent: 01 June 2023 07:51 To: cessda/cessda.cvs.two @.> Cc: Bell, Darren S @.>; Mention @.> Subject: Re: [cessda/cessda.cvs.two] Concept-level URIs (Issue #455)

CAUTION: This email was sent from outside the University of Essex. Please do not click any links or open any attachments unless you recognise and trust the sender. If you are unsure whether the content of the email is safe or have any other queries, please contact the IT Helpdesk.

notes:

@darrenbell2https://github.com/darrenbell2 what algorithm do you use to generate IDs for exported concepts at present? i assume, you replace the #{notation} strings (e.g. #Sum) in URIs with an alphanumeric ID of length 7 at the DDI side.

— Reply to this email directly, view it on GitHubhttps://github.com/cessda/cessda.cvs.two/issues/455#issuecomment-1571904940, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXGJS64ZWKORBFSOSA73KLXJB6ZDANCNFSM6AAAAAAXMJHBGA. You are receiving this because you were mentioned.Message ID: @.**@.>>

darrenbell2 commented 1 year ago

Sorry all, I have seen that Oliver has already replied.. Thanks, Darren

From: Bell, Darren S Sent: 01 June 2023 11:58 To: cessda/cessda.cvs.two @.>; cessda/cessda.cvs.two @.>; Hopt, Oliver @.> Cc: Mention @.> Subject: RE: [cessda/cessda.cvs.two] Concept-level URIs (Issue #455)

Hi Stefan – thanks for getting back to us on this. I have cc’d in Oliver Hopt at GESIS re the identifier question below, as he extracts the information before publishing into BitBucket, at which point we ingest it into Apache Fuseki. Thanks, Darren

From: Stefan Dlugolinsky @.**@.>> Sent: 01 June 2023 07:51 To: cessda/cessda.cvs.two @.**@.>> Cc: Bell, Darren S @.**@.>>; Mention @.**@.>> Subject: Re: [cessda/cessda.cvs.two] Concept-level URIs (Issue #455)

CAUTION: This email was sent from outside the University of Essex. Please do not click any links or open any attachments unless you recognise and trust the sender. If you are unsure whether the content of the email is safe or have any other queries, please contact the IT Helpdesk.

notes:

@darrenbell2https://github.com/darrenbell2 what algorithm do you use to generate IDs for exported concepts at present? i assume, you replace the #{notation} strings (e.g. #Sum) in URIs with an alphanumeric ID of length 7 at the DDI side.

— Reply to this email directly, view it on GitHubhttps://github.com/cessda/cessda.cvs.two/issues/455#issuecomment-1571904940, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AEXGJS64ZWKORBFSOSA73KLXJB6ZDANCNFSM6AAAAAAXMJHBGA. You are receiving this because you were mentioned.Message ID: @.**@.>>

Stifo commented 1 year ago

@OliverHopt @darrenbell2 thanks, but I was asking on the algorithm generating IDs. I assume that DDI would rather continue using the current one to be later consistent with the current alg. after the UIDs will be generated inside the CVS SKOS export routine. I've already prepared some code, which is based on MD5 hashing function; e.g., MD5(code.notation)[:7]. There is a separate export-ddi_rdf.xml template added, which converts all the concept.notation attributes to UIDs.

OliverHopt commented 1 year ago

@Stifo actually we would need to look into the source code of Saxon HE 9 to fully answer the question on the algorithm being used. According to https://www.oreilly.com/library/view/xslt/0596000537/re48.html the method is defined in the XSL szandard to just garanty that the generated ID would remain the same within the same run of the transformation for a given node.

Therefore I allready made sure, that I I reuse the IDs from previous transformations, if available. if CVS would be using a generating mechanism in the future, this will definitly change the IDs being assigned. In that case, we would need to wait with going productive with the DDI Aliance service.

From my perspective, the choice is either to be some kind of fast with the change in CVS to make it your choice. Or to extend the DB schema of CVS to integrate the ID being assigned by the transformation.

Stifo commented 1 year ago

Initial implementation of generating code IDs in URIs of SKOS exports for DDI Alliance

If exporting SKOS and the name of the vocabulary's alliance is "DDI Alliance", then a separate template for DDI export-ddi_rdf.xml is chosen. This template replaces all the code.notation occurrences with generated hash codes; e.g. code.generateHash('md5', code.notation, 7). There's an MD5 algorithm used and truncated to the first 7 characters to match the requested ID length. If the length is set to 0, then no truncation is performed. There are also other hashing algorithms available: md2, sha1, sha2, sha256. New hashing algorithms can be added here. Also, Agency's URI codes can be set to generate ID in place of code.notation. To set it up, edit the URI code in CVS App under Agency → Edit (dev).

darrenbell2 commented 1 year ago

Hi all - I have checked some example SKOS output for https://vocabularies-staging.cessda.eu/vocabulary/AggregationMethod?lang=en for version 1.2.2. in Danish. A small number of minor things still to correct but I think we're nearly there:

(1) ConceptScheme URI should have trailing slash i.e.

should read

(2) ConceptScheme has a dcterms:title but doesn't not have a skos:prefLabel - which is required by the W3C spec.

(3) should, strictly speaking, have an xml:lang attribute, even though it will always be "en".

I'll run some additional SKOS validators later but otherwise, it's looking a lot better now,

Many thanks, Darren

MajaDolinar commented 1 year ago

@Stifo please have a look at Darren's comment.

Stifo commented 1 year ago

Resolved (1), (2), and (3). changes will be available in the dev and staging soon.

darrenbell2 commented 1 year ago

@Stifo, just looked at staging. APologies if you haven't deployed yet.

(1) ConceptScheme URIs Still inconsistent e.g. examples selected randomly: https://vocabularies-staging.cessda.eu/vocabulary/AggregationMethod?lang=en has export with [there should not be a version number here] whereas https://vocabularies-staging.cessda.eu/vocabulary/TypeOfConceptGroup?lang=en has export of https://vocabularies-staging.cessda.eu/vocabulary/TypeOfConceptGroup?lang=en [no version number which is correct, but still has no trailing slash]

(2) ConceptScheme doesn't not have a skos:prefLabel I'm still not seeing a skos:prefLabel for the ConceptScheme in multiple examples

(3) skos:notationshould, strictly speaking, have an xml:lang attribute. I'm still not seeing xml:lang attribute for skos:notation in multiple examples.

Stifo commented 1 year ago

@darrenbell2 the deployment is being stucked for both dev and staging. the dev does not even work and gives 503 error response.

@matthew-morris-cessda would you please take a look at the dev/staging deployment?

matthew-morris-cessda commented 1 year ago

Fixed. We're in the middle of migrating Docker repositories and an invalid configuration snuck in.

Stifo commented 1 year ago

thanks @matthew-morris-cessda! @darrenbell2, it's ready in dev/staging, please, take a look, thanks.

darrenbell2 commented 12 months ago

Checked a few random SKOS exports in different languages on Staging

1 ConceptScheme URIs appear to be showing trailing slashes now - consider resolved 2 ConceptScheme now showing skos:prefLabel - consider resolved 3 skos:notation has xml:lang attribute - consider resolved

So, looking good so far. Many thanks, Darren @OliverHopt tagged

MajaDolinar commented 11 months ago

Issue resolved. Ready for release.