Define API(s) that expose dataset descriptions

ddeboer commented 2 years ago

The API(s) should support searching and retrieving dataset descriptions.

SPARQL?
Elasticsearch?

If possible, Ineo becomes a client of this API. If not, a CLARIAH Dataset Registry adapter will have to push into Ineo’s database. But keep in mind that Ineo is not the sole client of the Dataset Registry (for example when scaling up to European dataset indexes, as @femmynine remarked).

jblom commented 2 years ago

Next to search, I would also suggest providing a simple(r) export endpoint (with minimal filter options), where clients can download (a part of) the data registry as a file (JSON, different LD formats), so it becomes really easy for Ineo (and other clients) to load the data registry into their local database/fs.

I would definitely not go for implementing an adapter that pushes to Ineo (or any other client of course). It is highly unsustainable. Also for Ineo this would mean they would need to expose an API, which is more difficult/risky than to write a simple import script (that occasionally runs).

menzowindhouwer commented 2 years ago

If we go for an existing repository/registry, e.g. CKAN, Dataverse or Fedora, there will, in general, be an API already ...

jblom commented 2 years ago

@menzowindhouwer @ddeboer yes... very good point: it seems we first need to investigate which of these existing registries is already suitable for our needs. Should I create an issue for that? (can't seem to find one)

Amongst other things (such as good support), the system should at least have a good API so clients can easily search/retrieve datasets. Also it would be great if it's possible to extend the system with a way of publishing to different dataset models (e.g. based on schema.org or DCAT or the NDE dataset model for cultural heritage collections).

It would be nice if we can avoid building a registry ourselves, since we do not have that much time.

In the B&G/Media Suite search API (still a private repo, hope to open it up soon) we also have a collection_registry endpoint that can harvest collections from CKAN (or any other registry if you implement a plugin). It should not be too hard to extend that endpoint with a way to generate different publishing models as well (and implement content negotiation by profile, which @wmelder also implemented for our B&G open datasets endpoint).

menzowindhouwer commented 2 years ago

@jblom created #90 to discuss existing registries or using an faceted index

wmelder commented 2 years ago

We use a very old version of CKAN for internal use in the Media Suite, and use the simple API provided. But this API seems not adequate for Clariah purposes. New CKAN version including DCAT extension seem promising. It contains dataset endpoints and harvesting facilities. None of this has been installed or tested on B&G side yet. Nor do we participate in the CKAN development. Currently, we provide open dataset descriptions in schema.org on our beng-lod-server. It is used only for open datasets published by B&G.

CLARIAH / clariah-plus

Define API(s) that expose dataset descriptions #67