SSHOC / vocabularies

0 stars 0 forks source link

Ingest TaDiRAH Vocabulary #2

Closed dpancic closed 1 year ago

dpancic commented 4 years ago

In GitLab by @vronk on Dec 6, 2019, 17:20

One of the most widely used vocabularies for classifying resources in DH is TaDiRAH. http://tadirah.dariah.eu/vocab/index.php

It is not readily available as SKOS dump (which we need for ingest) The SKOS link in the UI only deliver individual chunks (individual concepts, or the top level of the whole concept scheme)

The SPARQL endpoint http://tadirah.dariah.eu/vocab/sparql.php does not work There is a different SPARQL endpoint: https://vocabularyserver.com/tadirah/en/sparql.php

@vronk was working on getting the SKOS.

(notify: @tparkola, @KlausIllmayer, @clara.petitfils, @laureD19, @vronk)

dpancic commented 4 years ago

In GitLab by @vronk on Dec 6, 2019, 17:46

There is an initial skos/ttl by Sotiris (soon to be) committed: https://gitlab.gwdg.de/sshoc/vocabularies/blob/master/vocabularies/tadirah/tadirah.ttl

This seems to be based on the vocabularyserver-SPARQL endpoint, having non-canonical URLs of the concepts.

In TaDiRAH it is: http://tadirah.dariah.eu/vocab/?tema=7 In Vocabularyserver: http://vocabularyserver.com/tadirah/en/?tema=6

Non of which is really "nice" cool URIs.

Moreover, with TaDiRAH this is especially tricky, as there is new activity around it in the context of CLARIAH-D, planning to refurbish TaDiRAH (resource and service). We also proposed the option to host it under https://vocabs.dariah.eu which would be the most consistent, but this is pending. So I guess for now we can work with any of the URL-schemes available and will have to adjust once there is some new stable solution.

I only wonder, how @vronk do you plan to match the the data (e.g. from Tapor) to the concepts, just on the prefLabels?

dpancic commented 4 years ago

In GitLab by @vronk on Dec 9, 2019, 10:41

TaDiRAH activities was extracted from the endpoint https://vocabularyserver.com/tadirah/en/sparql.php. I had to make some fixes and rebase the URIs. The vocabulary is available through PP Thesaurus manager in https://sshoc.poolparty.biz/Vocabularies/tadirah-activities.html and as usual, it can be managed through PP. It is also queriable through https://sshoc.poolparty.biz/PoolParty/sparql/Vocabularies. This endpoint is used in the ingestion pipeline to annotate tools with the activity property. For Tapor, values from the field tadirah-goals map directly to the skos:notation of the top concepts of the the TaDiRAH activities vocabulary.

dpancic commented 4 years ago

In GitLab by @vronk on Dec 13, 2019, 13:35

There is now also another version of the ActivityType in SKOS/TTL generated by the tadirah people directly (Vicky Dritsou): https://github.com/acdh-oeaw/nemo/blob/master/SO_ActivityTypes_v.1.3(SKOS_version).ttl

it differs from what Sotiris generated: https://github.com/acdh-oeaw/nemo/blob/master/ActivityType_SSHOC_NeMO.diff

Given that they are the authoritative source, we should probably switch to that one.

dpancic commented 4 years ago

In GitLab by @vronk on Dec 13, 2019, 14:51

Agree, but this is the Nemo vocab, correct? The differences I see are that they include also mappings to TaDiRAH that where previously unavailable.

dpancic commented 4 years ago

In GitLab by @vronk on Dec 13, 2019, 19:15

Indeed, you're right. I mixed up those two. :( So the comment belongs to NeMO. #1

dpancic commented 4 years ago

In GitLab by @vronk on Apr 8, 2020, 16:46

closed

dpancic commented 4 years ago

In GitLab by @vronk on May 28, 2020, 11:58

We seem to have some issues with TaDiRAH, with respect to identifying/referencing the individual concepts. Currently TaDiRAH is ingested in PoolParty and gets a poolparty.biz-identifier, e.g.: Research Activities: https://sshoc.poolparty.biz/Vocabularies/tadirah-activities/98241f0e751154ab (using hash for the concept - AFAI understood, this has been introduced lately by Sotiris, to circumvent the problem with unallowed characters in the concept label (#4).

In the MP a version is ingested however with identifiers in the (original) form: https://sshoc.poolparty.biz/Vocabularies/tadirah-activities/Analysis (i.e. the concept label is part of the URL)

Also the internal identifier in MP is basically its label

See also example result from MP-API: https://sshoc-marketplace-api.acdh-dev.oeaw.ac.at/api/concept-search?q=Analysis

However the concepts also have their "original" identifiers in the tadirah.dariah.eu namespace: http://tadirah.dariah.eu/vocab/index.php?tema=6 and vocabularyserver.com http://vocabularyserver.com/tadirah/en/?tema=6

And moreover there is work on a new version of TaDiRAH, which will be available under https://vocabs.dariah.eu...

Furthermore the concepts are being referenced in the source data mostly via labels.

Where does this lead us...? I guess for the lookup from sources we cannot but rely on the labels (as we have nothing better).

The question is what should be the main identifier for the concepts, combined with the decision, what would we consider the authoritative location/version of the vocabulary.

A specific issue I believe is the discussion between @vronk and @tparkola regarding the hashing of the concept-uris in poolparty.

Any suggestions, comments more than welcome.