CLARIAH / clariah-plus

This is the project planning repository for the CLARIAH-PLUS project. It groups all technical documents and discussions pertaining to CLARIAH-PLUS in a central place and should facilitate findability, transparency and project planning, for the project as a whole.
9 stars 6 forks source link

Connect harvester to NDE Dataset Register #97

Open ddeboer opened 2 years ago

ddeboer commented 2 years ago

The NDE Register will be used for (at the very least) B&G (#96) and KB.

Please find an example query here. Replace the

BIND (<http://data.bibliotheken.nl/id/thes/p075301482> as ?publisher)

with the publisher you want to retrieve datasets for. For a list of publishers, see this query%20WHERE%20%7B%0A%20%20%3Fs%20dct%3Apublisher%20%3Fo%20.%0A%20%20%3Fo%20foaf%3Aname%20%3Fname%0A%7D%20GROUP%20BY%20%3Fo%20%3Fname%20ORDER%20BY%20DESC(%3Fcount)).

Semantics of the query arguments are described at https://github.com/netwerk-digitaal-erfgoed/dataset-register#dcatdataset and based on the Requirements for Datasets.

You can also have a look at the NDE Dataset Register website for examples.

ddeboer commented 2 years ago

During today’s tech day, we discussed the idea of having a preparatory SPARQL query that returns a list of provider URIs to use in the regular query. On the side of the NDE Dataset Register we can add some predicate to datasets that should be included in the CLARIAH Registry. To keep things standardised, NDE then provides a SPARQL query that selects on that predicate to the CLARIAH Harvester.

This is similar to the <registry url=""> that the Harvester already supports for a URL that provides a list of OAI-PMH endpoints. Perhaps <registry query="SELECT ?uri WHERE { ?uri a dcat:Dataset ; <custom:predicate> includeInClariah . }">.

ddeboer commented 2 years ago

@menzowindhouwer As discussed, I’ve now changed the example query to a CONSTRUCT, allowing you to get its results as a single RDF graph per dataset rather than (duplicated) SELECT result bindings.

ddeboer commented 7 months ago

@menzowindhouwer @vicding-mi Can you elaborate on how you select datasets from the NDE Dataset Register for inclusion in the CLARIAH one? If I remember correctly, you do so on the level of the dataset’s publisher. If so, we want to add more publishers to that list, including https://uba.uva.nl, as requested by @LvanWissen.

LvanWissen commented 7 months ago

On the other hand, I also see that not all datasets published by https://uba.uva.nl/ are relevant for CLARIAH. Some of the datasets in there are created in research projects, such as ECARTICO, OnStage and Cinema Context, and are relevant. The main collection can, for instance, stay only in the NDE register.

A more advanced filter would not only look at publisher, possibly also at creator/contributor (and their ORCiD or ROR identifiers).

ddeboer commented 7 months ago

@LvanWissen In that case, please see if https://github.com/netwerk-digitaal-erfgoed/dataset-register/issues/483 would solve your use case.

LvanWissen commented 7 months ago

Yes, but that's an 'opt-in' on my side, as it requires an extra attribute to the dataset description. I'd rather see an 'opt-out'. In my opinion, filtering should be done on the harvesting party's side.

vicding-mi commented 7 months ago

the harvest is based on the following sparql query, correct me if I am wrong please @Menzo ;)

PREFIX dcat: <http://www.w3.org/ns/dcat#> PREFIX dct: <http://purl.org/dc/terms/> PREFIX dc: <http://purl.org/dc/elements/1.1/> PREFIX owl: <http://www.w3.org/2002/07/owl#>

SELECT * WHERE
{{
  BIND ***@***.***}> as ?publisher)

  ?dataset a dcat:Dataset ;
  dct:title ?title ;
  dct:license ?license ;
  dct:publisher ?publisher .

  OPTIONAL {{ ?dataset dct:description ?description }}
  OPTIONAL {{ ?dataset dcat:keyword ?keyword }}
  OPTIONAL {{ ?dataset dcat:landingPage ?landingPage }}
  OPTIONAL {{ ?dataset dct:source ?source }}
  OPTIONAL {{ ?dataset dct:created ?created }}
  OPTIONAL {{ ?dataset dct:modified ?modified }}
  OPTIONAL {{ ?dataset dct:issued ?published }}
  OPTIONAL {{ ?dataset owl:versionInfo ?version }}

  OPTIONAL {{ ?dataset dcat:distribution ?distribution .
              ?distribution dcat:accessURL ?distribution_url .
           }}
  OPTIONAL {{ ?distribution dcat:mediaType ?distribution_mediaType }}
  OPTIONAL {{ ?distribution dct:format ?distribution_format }}
  OPTIONAL {{ ?distribution dct:issued ?distribution_published }}
  OPTIONAL {{ ?distribution dct:modified ?distribution_modified }}
  OPTIONAL {{ ?distribution dct:description ?distribution_description }}
  OPTIONAL {{ ?distribution dct:license ?distribution_license }}
  OPTIONAL {{ ?distribution dct:title ?distribution_title }}
  OPTIONAL {{ ?distribution dcat:byteSize ?distribution_size }}
}}

On 30 Nov 2023, at 11:20, David de Boer @.***> wrote:

@menzowindhouwerhttps://github.com/menzowindhouwer @vicding-mihttps://github.com/vicding-mi Can you elaborate on how you select datasets from the NDE Dataset Register for inclusion in the CLARIAH one? If I remember correctly, you do so on the level of the dataset’s publisher. If so, we want to add more publishers to that list, including https://uba.uva.nlhttps://uba.uva.nl/, as requested by @LvanWissenhttps://github.com/LvanWissen.

— Reply to this email directly, view it on GitHubhttps://github.com/CLARIAH/clariah-plus/issues/97#issuecomment-1833475526, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AHRYZCY2MKMYLYJMDIX7FELYHBMYLAVCNFSM5QVTF6HKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBTGM2DONJVGI3A. You are receiving this because you were mentioned.Message ID: @.***>