CLARIAH / clariah-plus

This is the project planning repository for the CLARIAH-PLUS project. It groups all technical documents and discussions pertaining to CLARIAH-PLUS in a central place and should facilitate findability, transparency and project planning, for the project as a whole.
9 stars 6 forks source link

Implement Ineo export/import for tool discovery #35

Closed proycon closed 7 months ago

proycon commented 2 years ago

The component is defined in https://github.com/CLARIAH/clariah-plus/blob/main/technical-committee/shared-development-roadmap/epics/shared/fair-tool-discovery.md as follows:

Client using the tool store API (or a direct extension thereof) converting output to a format understood by Ineo, for interoperability with it. Needed if (and only if) we can't make Ineo correct directly to our tool store backend

Relies on a clear specification of Ineo's YAML import (also requested in #32).

proycon commented 2 years ago

I'm considering this to more of an Ineo issue than a Tool Discovery issue. I'd like to delegate the implementation of this to the Ineo folks, as they are best acquainted with their own back-end (sanity) and the way they want to structure the data for representation.

Ineo should, either directly or via some kind of periodic import function that transforms the data, call our tool store API. The tool store offers a rich linked data, with SPARQL endpoint and JSON-LD or turtle serialisations per resource. This linked data is the 'end product' from our perspective, of course we will deliver documentation and full support so the data can be clearly understood.

(relevant for @Seb-CLARIAH )

Seb-CLARIAH commented 2 years ago

I agree, although they will need information and collaboration from the CLARIAH developers. I will ask Erik van Arendonk from Eight to arrange an online meeting.

Sebastiaan Fluitsma

Van: Maarten van Gompel @.> Verzonden: maandag 4 april 2022 15:34 Aan: CLARIAH/clariah-plus @.> CC: Sebastiaan Fluitsma @.>; Mention @.> Onderwerp: Re: [CLARIAH/clariah-plus] Implement Ineo export for tool discovery (Issue #35)

I'm considering this to more of an Ineo issue than a Tool Discovery issue. I'd like to delegate the implementation of this to the Ineo folks, as they are best acquainted with their own back-end (sanity) and the way they want to structure the data for representation.

Ineo should, either directly or via some kind of periodic import function that transforms the data, call our tool store API. The tool store offers a rich linked data, with SPARQL endpoint and JSON-LD or turtle serialisations per resource. This linked data is the 'end product' from our perspective, of course we will deliver documentation and full support so the data can be clearly understood.

(relevant for @Seb-CLARIAHhttps://github.com/Seb-CLARIAH )

— Reply to this email directly, view it on GitHubhttps://github.com/CLARIAH/clariah-plus/issues/35#issuecomment-1087563850, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ARRMOLR6RYOVBYC3BQKUIC3VDLVTDANCNFSM5LHPHUBQ. You are receiving this because you were mentioned.Message ID: @.**@.>>

proycon commented 2 years ago

We (with @Seb-CLARIAH, @menzowindhouwer, @roelandordelman, @tvermaut) had a meeting this morning about integration of the harvesting pipelines (Tool Discovery and Data) with Ineo, which is a crucial feature that has some urgency. It seems it wasn't entirely clear yet what both parties expected from each-other and this needs to be clearly worked out, especially because an external party is involved that needs to be given a clear mission with clear requirements and expectations.

A further meeting will be established to discuss the detail, this is in line with what Sebastiaan already announced on April 4th (and which I was basically waiting for).

As a basis for such a meeting: I'll attempt to formulate here some clear requirements for the Ineo developers about connecting to our tool discovery pipeline, as is the scope of this issue (@menzowindhouwer's data harvesting pipeline will have different requirements and should be described seperately in another issue):

We start from these initial observations:

  1. The output of the Tool Discovery pipeline is strictly linked open data
  2. Ineo uses its own (Sanity) backend
  3. Both already have certain assumptions on properties and vocabularies, but both also have a degree of flexibility still and certain aspects are not formally decided yet.

The mission, to be put forward to the Ineo developers, is:

make use of the tool discovery output in an automated fashion at regular intervals to populate Ineo (or at least least to populate the so-called right-hand side of Ineo that concerns metadata)

Important principles, from my perspective, are:

This comes down to needing to implement some kind of conversion software, in whatever way they see fit, that pulls from the Tool Discovery Store and translates the output to whatever format Ineo and its Sanity backend require.

The Tool Discovery pipeline has a development instance running at https://tools.dev.clariah.nl/. Ineo developers can use this as the source until the production instance is up and running (#129):

We build on top of codemeta and schema.org, a specification of the data format as we use it and communicate it to developers is documented in Software Metadata Requirements (in development). This, in combination with the upstream definitions by the projects we build on, should hopefully provide sufficient information to come to understand the data and map it to whatever format Ineo needs.

The tool discovery pipeline aims for completeness; all CLARIAH tools should be in there (this is in the requirements). However, this is not necessarily so for Ineo where other criteria may be used to warrant inclusion or exclusion. The tool discovery output encodes software at a fairly fine-grained level that aligns with the technical reality of the software and the way it is managed in source code repositories. We explicitly distinguish the software source code and the so called 'target products' (which may be 'software services' or other build artifacts) that emerge from the source code. The notion that software offers various interfaces and different interfaces are appropriate for different audiences is something we especially try to accommodate in our harvesting pipeline.

Ineo may decide to aggregate tools in a way that better fits the end-user's need. But I do think it should ensure that there is always a path to obtain the software and its source code itself (and credit the original authors, license etc), not just provide access to running services.

Core metadata vocabulary is already decided from our side, but various vocabulary discussions are still ongoing (#32) and we need to come to a common agreement there soon, this will determine the more 'prescriptive' side of what we are actively demanding developers to provide aside from the basics.

@Seb-CLARIAH: I hope this provides enough information to get started on this and as input for the meeting your propose?

proycon commented 1 year ago

This issue hasn't been updated in a while and progress is slow, just for the record the current (revised) strategy is as follows:

Wat betreft de route naar Ineo hebben @tvermaut , @menzowindhouwer en ik op ons laatste overleg in April besloten dat Menzo zowel de data als de tools verzamelt en in één keer aan vanuit onze kant transformeert naar Ineo's gewenste input formaat en het daar naartoe pusht, inclusief ook de rich content (https://github.com/CLARIAH/ineo-content/). Voornaamste argument hierbij was om zelf aan onze kant alle controle te hebben en het voor de front end devs zo simpel mogelijk te houden. Dit wijkt dus wel af van wat ik eerder zelf voorstelde, waarbij Ineo rechtstreeks met de tool discovery backend zou communiceren en de transformatie bij de externe partij ligt. Nadeel van de nieuwe aanpak is dat dit meer werk voor ons is, want wij (in dit geval dan Menzo & team, ikzelf niet zo zeer) nemen in feite het grote deel van de Ineo data provisioning op, wat dus extra tijd in beslag neemt. Grote voordeel is dat we veel meer regie houden en minder afhankelijk zijn van externe partijen.

proycon commented 1 year ago

@menzowindhouwer What is the current state of this (the transformation of the codemeta files to Ineo)? I see there is work being done in https://github.com/CLARIAH/ineo-collaboration and am curious to hear how it's progressing.

I'm also wondering to what extent there is or is not some code already to transform the rich content from https://github.com/CLARIAH/ineo-content/ to whatever data structures Ineo wants?

menzowindhouwer commented 1 year ago

We worked on the pipeline to assess if the record is a candidate for INEO and/or it has been updated and needs to be refreshed in INEO. Also we worked on reading the rich user content into a python dictionary. Currently we're busy hashing out the template that will steer the transformation and the merge between the rich user content and the codemeta record.

proycon commented 7 months ago

This has been implemented by @inge1211 in https://github.com/CLARIAH/ineo-collaboration