CLARIAH / clariah-plus

This is the project planning repository for the CLARIAH-PLUS project. It groups all technical documents and discussions pertaining to CLARIAH-PLUS in a central place and should facilitate findability, transparency and project planning, for the project as a whole.
9 stars 6 forks source link

How to handle, automatically harvest and curate software metadata? #23

Closed proycon closed 2 years ago

proycon commented 3 years ago

I spoke with Sebastiaan Fluitsma about Ineo and the role of software metadata today. On past occasions I also often spoke with @janodijk about this. I believe software metadata curation fits the theme of this interest group, although it also heavily involves Linked-Open-Data (poking @rlzijdeman), so I'll post my thought on software metadata and automatic harvesting here:

Ineo is a portal for researchers that aims to present various CLARIAH resources (tools/services and data). One of the concerns we acknowledged was the need to keep the tool metadata in Ineo up-to-date; tool metadata should be accurate, version numbers correct, links valid. This may seem obvious but it is something that often goes wrong, so I'm advocating for clear update and automatic harvesting procedures for metadata.

I have been using codemeta as a solution for all my software metadata needs. Codemeta is a linked open data scheme for describing software metadata and it is especially focussed at providing so-called 'crosswalks' with various other existing software metadata standards. The crosswalk link metadata description fields from for example the Python Package Index, CRAN, Maven, Debian, etc. to to fields that are included in schema.org . Tools like codemetapy and codemetar do such conversions.

I'm a big supporter of storing the metadata as close to the software as possible, and automatically harvest and convert it where possible and in doing so avoid unnecessary data duplication. A certain amount of fundamental metadata can be harvested from the software repositories where the software is deposited (Python Package Index, CRAN, Maven, Debian,etc) .. Alternatively, codemeta metadata can be explicitly provided in the software's source code repository, by including a codemeta.json file as for example here.

For instance, for all software installed in a LaMachine installation, a codemeta registry is automatically compiled that describes all software it contains. This is used in turn to present a simple portal page like the one that can be seen on https://webservices.cls.ru.nl (A LaMachine installation on a production server in Nijmegen). Of course Ineo is going to be more elaborate than this, but I would still be in favour of letting it pull metadata from a registry that's in part compiled by automatic harvesting from other metadata sources as much as possible. I would want to prevent having all kinds of different versions of duplicated metadata in existence, especially if those are independently and manually curated.

The codemeta initiative is limited to the more fundamental metadata that describes all software, this is not enough. There has been an effort by @janodijk to compile official CMDI metadata for various CLARIN/CLARIAH WP3 tools which takes into account more elaborate domain-specific metadata. This has been a manual curation effort. This is great, but a main concern I have here is that there seems not be any proper update & maintenance mechanism here; currently the raw CMDI files are put on surfdrive. I'd much rather see them maintained in a git repository here in the CLARIAH group so we have a 1) clear update procedure, 2) proper version control and 3) transparency & community interaction.

I think metadata collection/curation could be a layered approach where we combine data from multiple sources when needed. We first grab the basic metadata from as close to the source as possible (converting it from whatever repo it is stored in to codemeta), usually containing metadata directly provided by the developers. Then on top of that we can have a manual curation effort that adds extra CLARIAH domain-specific fields. The final result could be expressed as linked open data in some form, like the JSON-LD that I use for codemeta, which I think is more flexible and preferably, but even as CMDI if that is still preferred. (I believe there are existing initiatives within CLARIAH that treat CMDI as Linked Open Data, like cmd2rdf?). Tools like Ineo can in turn pull from some kind of central CLARIAH metadata registry to always present accurate metadata.

This is just my view on things of course, which I just want to throw out here for debate because I think we have some gains to make here. The LOD-crowd probably has more to say on this too.

JanOdijk commented 3 years ago

Dear Maarten,

I just want to make a few points:

Jan From: Maarten van Gompel notifications@github.com Sent: maandag 22 februari 2021 22:03 To: CLARIAH/IG-Curation IG-Curation@noreply.github.com Cc: Odijk, J.E.J.M. (Jan) j.odijk@uu.nl; Mention mention@noreply.github.com Subject: [CLARIAH/IG-Curation] How to handle, automatically harvest and curate software metadata? (#1)

I spoke with Sebastiaan Fluitsma about Ineo and the role of software metadata today. On past occasions I also often spoke with @JanOdijkhttps://github.com/JanOdijk about this. I believe software metadata curation fits the theme of this interest group, although it also heavily involves Linked-Open-Data (poking @rlzijdemanhttps://github.com/rlzijdeman), so I'll post my thought on software metadata and automatic harvesting here:

Ineo is a portal for researchers that aims to present various CLARIAH resources (tools/services and data). One of the concerns we acknowledged was the need to keep the tool metadata in Ineo up-to-date; tool metadata should be accurate, version numbers correct, links valid. This may seem obvious but it is something that often goes wrong, so I'm advocating for clear update and automatic harvesting procedures for metadata.

I have been using codemetahttps://codemeta.github.io/codemeta as a solution for all my software metadata needs. Codemeta is a linked open data scheme for describing software metadata and it is especially focussed at providing so-called 'crosswalks' with various other existing software metadata standards. The crosswalk link metadata description fields from for example the Python Package Index, CRAN, Maven, Debian, etc. to to fields that are included in schema.org . Tools like codemetapyhttps://github.com/proycon/codemetapy and codemetarhttps://github.com/ropensci/codemetar do such conversions.

I'm a big supporter of storing the metadata as close to the software as possible, and automatically harvest and convert it where possible and in doing so avoid unnecessary data duplication. A certain amount of fundamental metadata can be harvested from the software repositories where the software is deposited (Python Package Index, CRAN, Maven, Debian,etc) .. Alternatively, codemeta metadata can be explicitly provided in the software's source code repository, by including a codemeta.json file as for example herehttps://github.com/LanguageMachines/frog/blob/master/codemeta.json.

For instance, for all software installed in a LaMachinehttps://proycon.github.com/LaMachine installation, a codemeta registry is automatically compiled that describes all software it contains. This is used in turn to present a simple portal page like the one that can be seen on https://webservices.cls.ru.nl (A LaMachine installation on a production server in Nijmegen). Of course Ineo is going to be more elaborate than this, but I would still be in favour of letting it pull metadata from a registry that's in part compiled by automatic harvesting from other metadata sources as much as possible. I would want to prevent having all kinds of different versions of duplicated metadata in existence, especially if those are independently and manually curated.

The codemeta initiative is limited to the more fundamental metadata that describes all software, this is not enough. There has been an effort by @JanOdijkhttps://github.com/JanOdijk to compile official CMDI metadata for various CLARIN/CLARIAH WP3 tools which takes into account more elaborate domain-specific metadata. This has been a manual curation effort. This is great, but a main concern I have here is that there seems not be any proper update & maintenance mechanism here; currently the raw CMDI files are put on surfdrivehttps://surfdrive.surf.nl/files/index.php/s/VEJOEkfbFtWR6Y6. I'd much rather see them maintained in a git repository here in the CLARIAH group so we have a 1) clear update procedure, 2) proper version control and 3) transparency & community interaction.

I think metadata collection/curation could be a layered approach where we combine data from multiple sources when needed. We first grab the basic metadata from as close to the source as possible (converting it from whatever repo it is stored in to codemeta), usually containing metadata directly provided by the developers. Then on top of that we can have a manual curation effort that adds extra CLARIAH domain-specific fields. The final result could be expressed as linked open data in some form, like the JSON-LD that I use for codemeta, which I think is more flexible and preferably, but even as CMDI if that is still preferred. (I believe there are existing initiatives within CLARIAH that treat CMDI as Linked Open Data, like cmd2rdf?). Tools like Ineo can in turn pull from some kind of central CLARIAH metadata registry to always present accurate metadata.

This is just my view on things of course, which I just want to throw out here for debate because I think we have some gains to make here. The LOD-crowd probably has more to say on this too.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/CLARIAH/IG-Curation/issues/1, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AF3FJGXC4TKFHWBUAAJ43ZTTALBAPANCNFSM4YBHTTXQ.

proycon commented 3 years ago

Hi @JanOdijk, thanks for your input!

The CMDI files on SurfDrive is there just for stable reference and accessibility by everyone (I refer to it in a paper on the topic in the CLARIN Selected papers) . Having a live version under a version control system is of course ok, and git is perfect for that.

Shall I set up a software-metadata repo in the CLARIAH group and commit all you have in surfdrive there? (I'll then transfer maintainership of the repository to you).

But the cmdi files must actually be hosted and maintained (and made available for harvesting by CLARIN VLO) by the centres where these tools or services run.

I suppose all the participating institutes could simply share and edit the same git repository and use that as the primary source? But I understand you want to have the metadata delivered alongside the actual service at the hoster's place, ensuring it's all in sync. That makes sense and is what I currently do with codemeta too, I do wonder to what extent this is implemented in practise for CMDI? (Our portal in Nijmegen for example offers no CMDI whatsoever).

The harvesting is an important point that's in line with my appeal to automate as much as possible. I don't know how exactly the VLO does this (I guess this is relates to OAI-PMH?), but if there's a proper mechanism in place and actually used in CLARIAH then I'd suggest Ineo exploit that too of course.

  • I fully agree that we should find an optimal balance between which information is stored in codemeta and which in a cmdi file, and how these two are kept in sync.

Agreed

  • If these topics are actively dealt with in the IG-Curation, then I am happy to join this IG. Please inform me when there are meetings

There was a meeting this morning, I'm not really a member of the IG-Curation group but I was invited to tell a bit about codemeta and software metadata. CMDI was discussed too, but you can probably say much more about it. There seemed to be a consensus in favour of linked open data (using schema.org vocabulary, which is what codemeta bases itself on too) in the discussion. @sebastiaanderks and Sebastiaan Fluitsma can probably update you more on this.

proycon commented 3 years ago

@JanOdijk I migrated all CMDI files from the SURFdrive to https://github.com/CLARIAH/software-metadata and sent you an invite to access/administer it. We should use that git repo as the authoritative source so we don't have any non-version-controlled copies around anymore.