Define components, standards, requirements for tool discovery

proycon commented 2 years ago

We need to clearly define the software components, service components and data components for tool discovery, along with the standards we adopt and requirements we want to set for all CLARIAH participants.

All these will be formulated here as part of the Shared Development Roadmap v2: https://github.com/CLARIAH/clariah-plus/blob/main/shared-development-roadmap/epics/fair-tool-discovery.md

It contains an initial proposal, which was already discussed and positively received by the technical committee, but further details remain to be filled. A workflow schema also needs to be added.

Further discussion can take place in this thread.

proycon commented 2 years ago

Relevant short blog post from the Sofware Sustainability Institute: https://software.ac.uk/blog/2021-05-20-what-are-formats-tools-and-techniques-harvesting-metadata-software-repositories

proycon commented 2 years ago

I also recommend this paper on FAIR research software: https://content.iospress.com/articles/data-science/ds190026

proycon commented 2 years ago

The main premises I envision for software metadata harvesting are:

All software metadata is stored and maintained at the source as much as possible. The closest to the source you can get is by having metadata in the source code repository itself. The idea is that software developers themselves are best capable of expressing their own metadata, doing so will be a requirement in the CLARIAH Requirements for Infrastructure and Software/Services.
We adopt codemeta as a common vocabulary (uses schema.org vocab as much as possible), this is expressed as JSON-LD so we can extend it with any additional linked data vocabulary we need (#32)
We automatically map to codemeta from various existing schemas (codemeta specifically defines crosswalks between schemas), we want to prevent any manual duplication. Almost all software ecosystems already have mechanisms in place for specifying basic metadata, these should be used (PyPi, Java Maven, Rust Cargo, Npm, etc etc), another source of metadata is CITATION.cff. Codemeta conversions already exist for several of these so we use these existing tools for codemeta mapping (codemetapy, codemetar, cffconvert) and extend where no solutions exist yet.

The role of the harvester is to collect software metadata in one common vocabulary ( codemeta, plus whatever extended vocabulary we need) from all CLARIAH software. The procedure is as follows:

The input for the harvester is a list of source code repositories (and service endpoints, but more about this later). This I call the tool source registry, it can be simply a git repository holding the necessary simple configuration files.
The harvester queries these source repositories (it simply git clones them and then looks for certain files)
Ideally, there is a codemeta.json at the root of the source repo, then we collect that and are done.
If not, we detect what other supported metadata is present, and invoke the necessary tool(s) to convert it to codemeta.
The harvester stores its results in the tool store

The role of the tool store is to hold and make available for querying the aggregated collection of codemeta records (one per tool). The tool store would then allow SPARQL and/or other queries on the data collection. Export functions to other metadata formats (CMDI, OAI, Ineo's yaml) could either be build-in server-side, or in dedicated clients.

As for implementations I'd like to aim for simplicity. I started a codemeta harvester (#33) concept in the form of a POSIX shell script that probably won't exceed 250 LoC. The real work of converting other metadata to codemeta is delegated to other dedicated tools (and that's where the work will be).

The tool store can probably also be kept quite simple. Just loading all triples into memory (it'll be of very limited scope after all, we don't intend to scale to thousands of tools) and allowing some kind of SPARQL query on it will also get us a long way. Alternatively, existing triple stores like Virtuoso could be considered (but might be overkill).

Now there's one major aspect which I skipped over. I want to make a clear distinction between Software and Software as a Service and this proposal thus-far has neglected the service aspect. However, in CLARIAH we're quite service-oriented and most software will be made available as a service, i.e. a web application hosted at a particular institute and made available over the web with proper federated authentication etc... We want to have these 'service entrypoints' in our metadata as well, but they don't fit the paradigm of being specified in the source code repository because the source code repository doesn't/shouldn't know where/when it is deployed.

Codemeta is more focussed on describing the source code, so already in 2018 I proposed an extension to codemeta that would allow for also describing entrypoints and specifying their interface type. This is limited and not intended to be a full interface specification like what OpenAPI or CLAM offers; the URL to such a full service specification is simply a field in this extension.

To accommodate software as a service I imagine that we also list service endpoints as part of the tool source registry (and not just the source repository). The harvester can then query these endpoints and convert the metadata in there to codemeta (e.g. using my extension), and augment the metadata obtained from the source repository with it. These endpoints could offer OpenAPI, OAI-PMH, or simply dublin core metadata in HTML, as long as we have some kind of tooling available to do a proper mapping (this is again where the actual work is). We'll probably have to cope with some amount of diversity but should limit this to a manageable degree by formulating clear software/service requirements for CLARIAH.

proycon commented 2 years ago

Had a quick call with @menzowindhouwer about this, to discuss possible alignment with the FAIR Datasets track. He wants to expand the already established OAI Harvest Manager with further options to deal with non-OAI-PMH and non-XML based metadata (of which codemeta would be one). Such functionality will be needed anyway for FAIR Datasets. I expressed some concerns regarding complexity when extending that harvest manager to do too much, although it looks well designed and fairly extensible. We decided to continue on both tracks, I'll implement the simple harvester because it will be easy and fast (and we need results quickly here). Menzo will continue with the harvester manager because that will be needed in other scopes (FAIR datasets) anyway. The harvester script I propose may also serve as an inspiration/example/proof-of-concept for further development of the OAI harvest manager. At the end we can always decide whether to replace the more simple solution with the more complex one if it proves more fruitful.

We'll eventually need further convergence regarding the tool store aspect as well, possibly using the same solution for both tools and data.

proycon commented 2 years ago

Software metadata is often encoded in READMEs. If there is no more formal schema available, we can extract metadata from a README and convert it to codemeta. An existing tool is already available that does precisely this: https://github.com/KnowledgeCaptureAndDiscovery/somef

proycon commented 2 years ago

As mentioned earlier, the current codemeta standard does not feature everything we need for a more service-oriented approach, as it focusses on describing the software source (schema:SoftwareSourceCode). We also want to be able to describe webservice and web application endpoints (in some generic terms) and make the software - software instance/deployment distinction explicit in metadata. I proposed an extension https://github.com/codemeta/codemeta/issues/183 but more work/thought may be required here. There is other existing ongoing work at schema.org and the W3C that may serve us here: this is described in schemaorg/schemaorg#2635 and schemaorg/schemaorg#1423 .

ddeboer commented 2 years ago

There’s a slight contradiction between:

All software metadata is stored and maintained at the source as much as possible.

and

We automatically map to codemeta from various existing schemas (…)

If not, we detect what other supported metadata is present, and invoke the necessary tool(s) to convert it to codemeta.

A way to solve this, is to make the codemeta.json a hard requirement and offer tooling and documentation on how owners can generate a codemeta.json based on their current metadata (e.g. GitHub repo metadata, language-specific package metadata etc.). I see two advantages of this approach:

Software owners keep full ownership of their metadata; there’s no ‘magic’ extraction that they have no control over. Instead, they themselves generate the codemeta.json, giving them the chance to make manual corrections to it.
It keeps the Harvester simpler because that only has to look for the codemeta.json file.

The question, of course, is whether we can ask this of software developers. We can mitigate this by offering a separate web service, a Software Metadata Extractor or CodeMeta Generator, where developers enter the URL of a repository, magic happens, and a codemeta.json is returned, which the developers copy, possibly modify and add to their repository. This way you make it easy for developers but still give them full ownership of the metadata.

A final problem is that of synchronising metadata: for example if developers change the LICENSE file or the license property in the language-specific package definition (e.g. package.json for NPM), how is that change propagated to the codemeta.json? Your approach has this same problem once owners have added a codemeta.json (because the Harvester would then ignore any other metadata).

proycon commented 2 years ago

Those are very good points yes. I was aware there was a bit of a contradiction and that the requirements might need some tweaking as the tool discovery task progresses. I was also a bit on the fence about how hard the requirement should be. The ownership argument you put forward is a good one and for CLARIAH software it would be fair demand to make. If we want to add some CLARIAH-specific vocabulary it might be inevitable even. But for possible external software and for some flexibility it helps if the harvester can do conversion for the cases where it wasn't already provided, it also helps prevent the sync issue you describe later.

We can mitigate this by offering a separate web service, a Software Metadata Extractor or CodeMeta Generator

Yes, the current harvester+conversion implementation I'm working on actually provides that function as well (without the webservice part though). The whole thing should remain simple enough.

A final problem is that of synchronising metadata: for example if developers change the LICENSE file or the license property in > the language-specific package definition (e.g. package.json for NPM), how is that change propagated to the codemeta.json? Your approach has this same problem once owners have added a codemeta.json (because the Harvester would then ignore any other metadata).

I think a part of the job of the harvester+conversion is to do some basic validation so blatant out-of-sync errors are reported.

But the syncing issue indeed remains; if users provide an explicit codemeta.json, update their package-specific metadata and neglect to update the codemeta.json again. This is part of why I was on the fence about requiring codemeta.json vs auto-converting it every time. Generation of the codemeta.json can also be automatically invoked from things like setup.py, or in a git commit hook or through a continuous deployment environment (but that might be overkill and complicate things).

CLARIAH / clariah-plus

Define components, standards, requirements for tool discovery #31