Define extra vocabulary for tool discovery

proycon commented 2 years ago

We decided on adopting codemeta (+schema.org), and linked open data in general, as a basis for software metadata. Codemeta derives a lot of terms from schema.org and actively collaborates with them. Additionally, they propose some terms on top of schema.org that are not assimilated yet.

However, on top of this, we may still need certain CLARIAH specific terms for describing more domain-specific aspects of software/service. In this issue, which can be considered a continuation of #23, we want to track that effort and define that vocabulary. We may be able to reuse vocabulary compiled by @JanOdijk in a prior CLARIN/CLARIAH project (which was in the form of a CMDI profile).

As Ineo is the prospective front-end for tool discovery, we also need an exact specification of the vocabulary they currently adopted (and I believe for which they have an import facility via YAML). They have undoubtedly given this subject a fair amount of thought already and a mapping between this vocabularies and codemeta + whatever extensions we add is essential.

(minor update 2022-02-23: The YAML import facility Ineo offers is not relevant for us)

proycon commented 2 years ago

Ineo's frontend is being delivered in january (according to Sebastiaan), that may already give some insight into the tool metadata vocabulary they use. More elaborate specifications from ineo's side (their yaml syntax) are planned for February.

ddeboer commented 2 years ago

Use https://schema.org/WebAPI (still a proposal, but already usable) for service endpoints (see also https://github.com/CLARIAH/clariah-plus/issues/31#issuecomment-1011376578) in addition to the standard CodeMeta vocab.
Create a SHACL shape to validate CLARIAH-specific codemeta.json files.

proycon commented 2 years ago

Use https://schema.org/WebAPI (still a proposal, but already usable) for service endpoints (see also Define components, standards, requirements for tool discovery #31 (comment)) in addition to the standard CodeMeta vocab.

I proposed the following upstream to codemeta with regard to handling service endpoints (with support for the proposed WebAPI indeed): codemeta/codemeta#271 . You could indeed also interpret that as defining extra vocabulary we need, though my aim with this current issue was more to identify possible CLARIAH-specific vocabulary which can not readily be upstreamed to codemeta or schema.org .

Create a SHACL shape to validate CLARIAH-specific codemeta.json files.

That sounds like a good solution yes, and alongside this we'll also need to do some more validation against the context as we discussed earlier in #31 (e.g. does the metadata not contradict the LICENSE file, the git tag vs the version in the codemeta, etc).

ddeboer commented 2 years ago

validation against the context

Do you mean by this the context of the repository holding the codemeta.json file, so any other files and any metadata (e.g. from the GitHub API)? The term ‘context’ is slightly confusing in the context of RDF vocabularies and JSON-LD contexts. 😁 Should we call this the ‘repository metadata’ instead?

If our requirements stipulate a maintainer (team) e-mail address, we could use that to send notifications if there’s a discrepancy between the codemeta.json and the repository (or possibly even auto-submit a PR with an updated codemeta.json).

Anyway, this discussion is quite separate from the one at hand: an extension of the CodeMeta vocabulary. I’ll read your proposals and see if I can come up with a PR here.

proycon commented 2 years ago

Do you mean by this the context of the repository holding the codemeta.json file, so any other files and any metadata (e.g. from the GitHub API)? The term ‘context’ is slightly confusing in the context of RDF vocabularies and JSON-LD contexts. 😁 Should we call this the ‘repository metadata’ instead?

Yes, that's what I meant yeah, things like the git tag, the LICENSE file, possibly mediated by the Github API.

If our requirements stipulate a maintainer (team) e-mail address, we could use that to send notifications if there’s a discrepancy between the codemeta.json and the repository (or possibly even auto-submit a PR with an updated codemeta.json).

Agreed, some kind of automated notification should be quite doable.

proycon commented 2 years ago

When looking at the earlier work done for (WP3) software metadata in CMDI by @JanOdijk et al, I notice the following fields:

toolCategory - Contains values like 'written language tool', 'mono-lingual tool'
toolTask - Contains values like 'lemmatisation', 'tokenisation', 'dependency parsing'
researchDomain - Contains values like 'linguistics', 'neurolinguistics'
researchPhase - Contains values like 'enriching data'
linguisticSubject - Contains values like 'general linguistics', 'syntax'

These are some very specific fields, taking a particular closed vocabulary, some of those are in the CLARIN Concept Registry but most are not. The thing they have in common, I think, are that they are all about categorizing the software along certain dimensions. These are eventually disclosed in a faceted search.

In schema.org we only have applicationCategory and applicationSubCategory. (I'm not even sure why they found it necessary to have two properties. Even with one applicationCategory property it's already possible to map to whatever categorization we want if the values are proper URIs, the property can simply be used multiple times if links to multiple taxonomies are desired.)

To encode this type of information, we have two options:

Just use applicationCategory multiple times pointing to different vocabularies (fully qualified URIs).
Explicitly define the properties in an independent CLARIAH-specific schema (adding to the @context).

In both cases we need to explicitly define the closed vocabulary in a proper linked open data manner, something like skos:concept would work and integrate best with existing effort. The question is also whether we really want to go into this level of detail at all, and if so, whether there are even more fields/values needed because the existing schema had a linguistics (WP3) focus.

I'm leaning more towards option 1 because categorization will always be something that different people want to do differently and multiple paradigms should be able to co-exist.

The second important information from the CMDI profiles is denoting the languages a particular tool is intended for, which is especially relevant for our WP3 tools. I had a brief discussion with the codemeta community on representing this back in 2018: codemeta/codemeta#188 .

Any thoughts?

jblom commented 2 years ago

@proycon I would also say option 1 would be nicer/easier/leaner. It shifts the responsibility of very domain specific stuff to the tool maker. Also it prevents defining a very detailed tool ontology too early in the game.

One downside of using the applicationCategory is that in able to use the information, harvesters must resolve the URI and be able to at least grab a label before it can be nicely used (e.g. for facetted search in Ineo)

I'd say first get a first round of codemeta.json done (for all current tools per WP) and then see if there is really something lacking (when hooking up the tool store to Ineo).

proycon commented 2 years ago

Last monday we had a first exploratory meeting about Ineo with Sebastiaan Fluitsma and developers from Eight Media. The issue of aligning vocabularies was raised. We need to know what information Ineo requires and how this might align with information that is already available in codemeta/schema.org. This may result in a set of additional vocabulary (i.e. keys and values) we impose as software metadata requirements to the various CLARIAH developers. Sebastiaan will prepare an initial list. Ideally I'd like to keep this list of required extra keys as small as possible and make an explicit separation between requirements, recommendations and suggestions, this might help software metadata providers prioritize and keep the maintenance burden as low as possible.

For categorization Sebastiaan earlier said Tadirah is proposed as an option, but he added to stress that nothing was final at this point so this should not be relied on yet.

In the initial stages where we can harvest software metadata but have no additional vocabulary yet, we simply deal with what we have and certain fields may be empty (in line with what Jaap suggested above).

Let's keep eachother updated on this in this issue and use this as the leading discussion thread on this topic. There is a bit of unclarity (at least in my view) about who in the end establishes the vocabulary. I would say that it's a joined task between us (as Tool Discovery working group) and Ineo (and eventually for the board to approve or not). Anybody is welcome to join the Tool Discovery working group or to simply comment on anything we do here. The only thing I'd like to prevent as we go forward, if you agree, is there being separate decisions on software metadata vocabulary which bypass this group (as that defeats the purpose of us being a group).

To get collaboration with Ineo started, there was initial agreement that one or more Ineo developers would join here so we can establish open, quick and efficient communication lines (again; here on the issue tracker).

rlzijdeman commented 2 years ago

I would say that in dutch cultural heritage / social science history the default workflow is to first resort to schema.orghttp://schema.org. More specific vocabularies are used when none are available in schema, or when the data model implied by use of schema.orghttp://schema.org does not fit the data properly. Mind you, that specific issues, are tackled with by schema.orghttp://schema.org if you mention them (e.g. historical occupations can now be represented generically in schema).

Given what is it written below, and how aligns with describing domain data, I would very much favour a schema.orghttp://schema.org first, something more specific approach next for tools and data as well.

proycon commented 2 years ago

I agree with the schema.org-first approach, that's the idea behind our adoption of codemeta too, as they actively collaborate with schema.org and upstream their vocabulary. Nevertheless, there will undoubtedly be gaps in schema.org and codemeta (one of which we are trying to fill already in https://github.com/codemeta/codemeta/issues/271 , and though the aim is to upstream new vocabulary, that may be a slow process for which we sometimes need a shortcut (temporary externally defined vocabulary).

ddeboer commented 2 years ago

Just to be clear: if possible, we should define the extra vocabulary as an extension of existing Schema.org/Codemeta classes and properties (so subclasses and -properties that improve precision). This way, clients that understand Schema.org only will still be able to understand the new vocabulary parts.

proycon commented 2 years ago

Agreed, that's the approach we're taking in https://github.com/SoftwareUnderstanding/software_types as well (defines mostly subclasses of schema:SoftwareApplications)

proycon commented 2 years ago

I met with the IG Findability today (@rsiebes et al) and also discussed metadata a bit with @JanOdijk during the regular WP3 meeting.

For the extra vocabulary (categories, research domains etc) we're going to add a turtle/json-ld file to https://github.com/CLARIAH/tool-discovery and port some of the vocabulary that was already set up in earlier projects, probably using SKOS. All interested parties can contribute on getting the vocabulary there amended. I also hope to see representation from the Ineo side there .

An initial extraction of data from existing CMDI profile will be handled by @menzowindhouwer. @menzowindhouwer: Focus only on the highly specific fields for CLARIAH and not on any of the general stuff that is already in codemeta/schema.org.

For categorization Sebastiaan earlier said Tadirah is proposed as an option, but he added to stress that nothing was final at this point so this should not be relied on yet.

Though this looked fine to me, I've heard some criticism of Tadirah from people in the IG Findability and from @JanOdijk . I don't think we'll quickly find any one vocabulary that fits all perfectly though.

proycon commented 2 years ago

At this point I also want to stress that the extra vocabulary for categorization is a secondary concern and not a show-stopper for continued development of the conversion tools and harvester pipeline. I want to get a fully functional harvester pipeline up and running as soon as possible, even if it only contains the 'coarser' metadata.

JanOdijk commented 2 years ago

Here's the file (Excel) I sent 5 years ago to Menzo to provide definitions for the labels used inside the CLARIN profile profile ClarinSoftwareDescription clarin.eu:cr1:p_1342181139640, and for inclusion in the CLARIN Concept Registry. It never got integrated in the CLARIN Concept Registry, in part because the committee monitoring this was not functioning.

I am not sure this file contains all labels, it is possible and actually very likely that new ones have been added since. I will try to provide the actual lists of labels as well, but that is not so easy in the CMDI Component Registry. Link to the file: Concepts to be added for MD4T 2017-09-08.xlsx .

JanOdijk commented 2 years ago

Actually, the file I attached is richer than what I sent to Menzo, because the additional information (which is very useful) did not fit into the template for the CCR.

The file also contains currency codes, but these (if really needed) are better taken from independent sources.

proycon commented 2 years ago

Thanks! That is very useful information! We can use this list to extract the five categories I mentioned before:

toolCategory - Contains values like 'written language tool', 'mono-lingual tool'
toolTask - Contains values like 'lemmatisation', 'tokenisation', 'dependency parsing'
researchDomain - Contains values like 'linguistics', 'neurolinguistics'
researchPhase - Contains values like 'enriching data'
linguisticSubject - Contains values like 'general linguistics', 'syntax'

That's where the valuable additional information is, I think. We'll convert it to a more formalized json-ld schema and put it in https://github.com/CLARIAH/tool-discovery . (@menzowindhouwer I think you can skip that XSLT script now we have this)

proycon commented 2 years ago

We also need to describe natural languages that tools operate on, this is something most software metadata descriptions (aside from the CMDI profile) haven't concerned themselves with yet but something that is particularly relevant for NLP tools. I propose we reuse https://schema.org/availableLanguage . Formally it's domain would need to be expanded to include schema:SoftwareSourceCode and schema:SoftwareApplication.

proycon commented 2 years ago

Small update on the above, I had already proposed something to solve this to the codemeta community four years ago and have somewhat refined that proposal now, see https://github.com/codemeta/codemeta/issues/188 . It doesn't use availableLanguage like I suggested above.

JanOdijk commented 2 years ago

I extracted (manually, you do not want to know how because the interface of the CLARIN Component Registry is a disaster for this purpose) the actual values in the CMDI profile ClarinSoftwareDescription for many elements. They may contain values that are not yet included inthe Excel sheet I provided ealier. I put them in a text document with the following formal but somehat ad-hoc syntax: elementname (followed by :) values, each on a new line **********

The asterisks mark the end on the values. (the backticks in the source are here only to have the asterisks not interpreted by markdown ) Each value line contains three fields, comma-separated: label, description, PID Descriptions and PIDs are often absent. CSD vocabularies.txt

All the vocabularies are of course also in the profile definition (an XML file), often in multiple places. Also, the profile uses the ISO 639-3 language codes (some 6000 values...) very often (and the list of 6000 values is therefore repeated over and over). The choice to include all these values inside the XML profile seems not so wise to me given such cases.....

The schema file for this profile is here. It should be possible to download it as an XML file as weel, but I could not find anymore how (Menzo?).

menzowindhouwer commented 2 years ago

The CMDI profile can be downloaded here https://catalog.clarin.eu/ds/ComponentRegistry/rest/registry/1.x/profiles/clarin.eu:cr1:p_1342181139640/xml (replace /xsdby /xml) The spec for this XML format can be found here https://www.clarin.eu/content/cmdi-specification-version-12 section 3 and esp. section 3.5 does describe vocabularies as they are currently supported by CMDI 1.2.

In this case the profile used the vocabularies in a closed way, i.e., only these values are allowed. These values are thus part of the profile and will be included in the derived XSD, so during record validation the instances can be checked against the closed vocabulary.

CMDI 1.2 also supports open vocabularies. In this case the vocabulary lives in CLAVAS so editors & friends can use its API to suggest values.

CMDI indeed doesn't have support for closed vocabularies as a 1st class citizen, they are always tightly bound to a single element or attribute ... maybe a nice feature for CMDI 1.3 or 2.0 ;-)

menzowindhouwer commented 2 years ago

@proycon I will be happy to create the code to transform the vocabularies into their SKOS equivalent ...

proycon commented 2 years ago

@JanOdijk Thanks for the updated list!

I now actually took some time to look into the actual vocabulary for the five properties that seem of interest to me, and want to add some comments from my perspective:

researchPhase - This field and the vocabulary for this field makes a lot sense. Minor comment: I just wonder if it can be formulated a bit more consistently perhaps? We have Obtaining Data, Enriching Data and then Browsing and Searching which could perhaps be Exploring Data?
researchDomain - Makes clear sense, I just wonder if the vocabulary is extensive enough to cover everything we do in CLARIAH, but of course it's easy to amend when needed.
linguisticsSubject - This has a clear WP3 focus which is understandable considering the source, but I'd like to turn it into a more generic researchSubject. This does beg the question how researchDomain and researchSubject relate, I'd say they serve a very similar goal but are only at different levels of granularity. Do we really want two properties or can we deal with one? If we opt for one; it could still point to a rich taxonomy where we can point to any desired level of granularity, and the taxonomy itself would know which concepts are broader/narrower. This would simplify the metadata but puts a burden on the search application. If we do keep two properties, it simplifies search applications who need no inherent *knowledge of the taxonomy.
toolTask - Again a clear focus on linguistics (WP3), but the vocabulary here is very sensible and useful.
toolCategory - I don't see much added value in a lot of these categories because I don't really see a common theme: some group by modality, some group by research phase, some by language, Like with researchDomain and linguisticsSubject, I also think that the toolTask and toolCategory properties are very related and differ only in granularity. I would propose we use simply use the existing schema:applicationCategory property and point it to our own SKOS taxonomy that combines toolTask and toolCategory.

Some comments on some of the other categories you had:

license - already catered for by schema.org and we'll use the SPDX standard as vocabulary for that
AnnotationType - Seems more for data, out of scope for software metadata I think, overlaps a lot with tool tasks.

I will be happy to create the code to transform the vocabularies into their SKOS equivalent ...

@menzowindhouwer Sure, that'd be very welcome. Just put them in one or more Json-LD files in https://github.com/CLARIAH/tool-discovery/ and we can work from there. I'd like to offer our data in Json-LD because that makes it easier to work with also for people/parsers who are not interested in the entire underlying RDF model. When converting to SKOS, there's no need to focus on anything aside from the five aforementioned categories, the others are already covered (or redundant or needlessly detailed).

JanOdijk commented 2 years ago

Maarten, I agree with most of your comments. toolCategory was there to be backwards compatible with an earlier categorisation (made by others), but I also found that its values were oddly selected.

AnnotationType: well, we were very ambitious about the goals of metadata: we tried to make a profile that could be used not only as a basis for discovery of the tools (via VLO and other channels) but could also serve to formalise documentation wherever this was possible. In addition , we wanted the profile to support an approach in which each tool does not only apply to data, but also to metadata of these data: in the latter case it would find the actual data via the metadata, process the data, and enrich its output data with metadata of these outputdata. The AnnotationType was one of the properties to describe properties of input and output of the tool. So it indeed relates to data, but is appropriate here because it can be part of the description of the input and output of the tool (and the detailed description of the input and output properties will give a more refined description of the functionality of the tool . (but too detailed so the toolTask property cannot be missed (though ideally it should be derived automatically as a more general category from the input and output descriptions of the tool).

We also tried to derive the requirements that the CLARIN Switchboard imposes automatically from such descriptions. However, not many tools have been described in this way, we started with Frog and that might very well be the only tool described in this way so far.

proycon commented 2 years ago

I was in touch with an Ineo developer this week and have obtained the vocabulary lists they are currently using/proposing thus-far. This should provide useful for our further discussion. I committed the full dumps of the vocabulary to https://github.com/CLARIAH/tool-discovery/tree/master/schemas/ineo , for easy reference :

resource.researchActivity - Uses Tadirah
resource.researchDomain - No formal URIs, not sure what vocabulary this stems from. Possibly also tadirah? The categories look sensible to me.
resource.informationType - NeDiMAH ontology (less relevant for software, more for datasets)
resource.mediaType - MIME types

(Giving @menzowindhouwer an extra poke because he was also waiting for this level of information)

Seb-CLARIAH commented 2 years ago

resource.researchDomain uses the NWO Discipline codes, https://www.nwo.nl/disciplinecodes,

proycon commented 2 years ago

I'm hoping we can use our session on the CLARIAH Tech day tomorrow to discuss a bit further what vocabularies we want to adopt for tool categories, research domains and research activitities/phases.

JanOdijk commented 2 years ago

Dag Maarrten, I cannot attend but whatever you decide make sure that the distinctions made in the ClarinSoftwareDesecription profile can be mapped to these, preferably without loss of information. The values used there have been used for very many tools and services from NL, but also for the whole range of Weblicht Webservices (from two/three years ago, some 270 if I remember well). See http://portal.clarin.nl/clariah-tools-fs-global where you also find descriptions for e.g. part of speech taggers for German and other languages.

Jan

From: Maarten van Gompel @.> Sent: dinsdag 24 mei 2022 15:56 To: CLARIAH/clariah-plus @.> Cc: Odijk, J.E.J.M. (Jan) @.>; Assign @.> Subject: Re: [CLARIAH/clariah-plus] Define extra vocabulary for tool discovery (Issue #32)

I'm hoping we can use our session on the CLARIAH Tech day tomorrow to discuss a bit further what vocabularies we want to adopt for tool categories, research domains and research activitities/phases.

— Reply to this email directly, view it on GitHubhttps://github.com/CLARIAH/clariah-plus/issues/32#issuecomment-1135962677, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF3FJGUQWKIC6JQ7MC4HBUTVLTNXRANCNFSM5LHMV35A. You are receiving this because you were assigned.Message ID: @.**@.>>

proycon commented 2 years ago

i hadn't reacted to this reply by @JanOdijk yet. It turned out we didn't get to this subject anyway on the last tech day,. and probably won't have time for it the next one either:

whatever you decide make sure that the distinctions made in the ClarinSoftwareDesecription profile can be mapped to these, preferably without loss of information. The values used there have been used for very many tools and services from NL, but also for the whole range of Weblicht Webservices (from two/three years ago, some 270 if I remember well). See http://portal.clarin.nl/clariah-tools-fs-global where you also find descriptions for e.g. part of speech taggers for German and other languages.

I'll try my best to accommodate the prior work on the ClarinSoftwareDescription CMDI profile to a degree, but this will never be a complete one-on-one mapping. The ClarinSoftwareDescription profile serves first and foremost as an inspiration, example and discussion point for establishing the exact vocabularies we want. Other factors I'm taking into account here are 1) what's available already in schema.org/codemeta and existing software metadata ecosystems and 2) what choices Ineo (@Seb-CLARIAH) is committing to already, from the user perspective.

A precise mapping between our results and the ClarinSoftwareDescription CMDI profile will be relevant only for conversion of the output of our tool discovery pipeline, to be worked out in https://github.com/CLARIAH/clariah-plus/issues/37 (in Utrecht). To be clear, the reverse, a mapping FROM the ClarinSoftwareDescription CMDI profile to our metadata schema is explicitly not within our scope and conflicts with the tool discovery pipeline which considers other sources and schemas as primary. I consider the existing CMDI metadata files legacy data that will eventually be replaced (and automatically and continuously so) once #37 is completed, their role is then to enable interoperability with CLARIN tooling like the VLO that expects CMDI.

See http://portal.clarin.nl/clariah-tools-fs-global where you also find descriptions for e.g. part of speech taggers for German and other languages.

(That link is down btw)

proycon commented 1 year ago

As explained in #138, I'm considering the vocabulary discussions closed now.

proycon commented 1 year ago

It was decided in the Tech Committee today that we will finalize the software metadata requirements (which codifies the outcome of this vocabulary discussion) the upcoming techday (2022-10-27).

CLARIAH / clariah-plus

Define extra vocabulary for tool discovery #32