citation-file-format / citation-file-format

The Citation File Format lets you provide citation metadata for software or datasets in plaintext files that are easy to read by both humans and machines.
http://citation-file-format.github.io
Creative Commons Attribution 4.0 International
443 stars 108 forks source link

How to avoid several standards? publiccode.yml versus Citation.cff #398

Open broeder-j opened 2 years ago

broeder-j commented 2 years ago

First, let me thank you for your great work and efforts, and please do not get this as rude (I am far a way from all these efforts). I was just wondering if and how one can avoid multiple standards in this area:

As a data steward I see currently 2 (matadata) standards emerging in this area (your are probably aware of this?). Citation.cff and publiccode.yml While Citation.cff is meant for Software and Datasets, publiccode.yml is meant for Software only. Did you get in contact with the publiccode.yml team to avoid that there will be two standards in the future? However this is done, maybe at this 'early stage' one could still merge these two? Actually for certain software projects there are already other standards (which one cannot change). For example for python software they finally improve the metadata for the pyproject.toml. So for every python project people I work with double, triple, ... their metadata already (which I think is not nice).

At least in my area Citation.cff is widely used, promoted and supported already by invenio (look at zenodo), github and others (nicely done!). We definitely need rich metadata for software, Citation.cff is currently very minimal here, but it is the broader standard. In principle Citation.cff could just adapt all allowed metadata of publiccode.yml if the type is software.

Or do you think there is are good reason to have two or that these two things will develop in totally different directions, i.e that Citation.cff stays minimal (is meant to), just for citation purposes and that for software one does rich metadata in a publiccode.yml from which one could extract a Citation.cff? Then at some point there will be a publicdata.yml and why should one need a Citation.cff after this, besides as exchange format?

kevinmatthes commented 2 years ago

Does the second standard you mentioned provide the possibility to create a list of references for a repository?

My usual use case for a CITATION.cff is to provide information where I have some of the ideas I incorporated into that project from. It is also nice as a list of further readings when you recommend the project to your colleagues. Basically, I myself think of CITATION.cff as the analogous solution to a list of references in papers and theses. It is a common use case to document the consulted literature in a BibTeX database (or multiple ones) in an academic context. CITATION.cff enables people to do so with GitHub projects, as well. There are many research projects on GitHub which would like to have their repositories cited mostly by the associated paper(s) the software is related to. This freedom to configure a preferred-citation but something else but software is in my opinion unique to CITATION.cff.

As I already mentioned, this is just the academic point of view. There might be other use cases apart from this but I think that CITATION.cff was originally designed especially and explicitly for the academic usage. Thus, I consider CITATION.cff a solution for different needs than just describing software and, hence, worth staying a standard on its own.

sdruskat commented 1 year ago

Thanks for opening this issue, @broeder-j, and thanks for your comments, @kevinmatthes!

First, let me thank you for your great work and efforts, and please do not get this as rude (I am far a way from all these efforts). I was just wondering if and how one can avoid multiple standards in this area:

Thanks 🙏! Don't worry, no offense taken, these are important questions to ask!

As a data steward I see currently 2 (matadata) standards emerging in this area (your are probably aware of this?). Citation.cff and publiccode.yml While Citation.cff is meant for Software and Datasets, publiccode.yml is meant for Software only.

To clear this up, the primary focus of CFF is and will remain software! We have introduced type: dataset support only as a permanently experimental feature for those who keep datasets on GitHub (mainly) and want citation information rendered.

Did you get in contact with the publiccode.yml team to avoid that there will be two standards in the future? However this is done, maybe at this 'early stage' one could still merge these two? Actually for certain software projects there are already other standards (which one cannot change). For example for python software they finally improve the metadata for the pyproject.toml. So for every python project people I work with double, triple, ... their metadata already (which I think is not nice).

This is actually the first time I hear about publiccode.yml, thanks for bringing it to our attention! Perhaps it would be useful to mirror this issue also to https://github.com/publiccodeyml/publiccode.yml? Alternatively, perhaps @ruphy and/or @sebbalex could chime in here.

In terms of metadata duplication, I think this is currently unavoidable, but should ideally always be an outcome of automation, so that developers have only one source to maintain for any subset of the metadata. I'm also sceptical that there will be the One Metadata Format to rule them all anytime soon.

At least in my area Citation.cff is widely used, promoted and supported already by invenio (look at zenodo), github and others (nicely done!). We definitely need rich metadata for software, Citation.cff is currently very minimal here, but it is the broader standard. In principle Citation.cff could just adapt all allowed metadata of publiccode.yml if the type is software.

We actually had this discussion with regard to - and with - CodeMeta as well. The outcome of this discussion has been that for software citation purposes CITATION.cff should be used, for generic software description, and whenever automated services are used from the point of, e.g., reusing metadata from a publication repository (such as Zenodo), CodeMeta should be used. Both formats are generally compatible, and there are available conversion tools for both directions.

So basically, it is out of scope for CFF to be a generic metadata format. It focuses on software metadata for citation purposes, including "advanced" cases such as "software citing software" (as @kevinmatthes mentions above).

I think one interesting question would be whether publiccode.yml are aware of CodeMeta, and if so, what the reasons were to create their own format and not reuse CodeMeta?

Or do you think there is are good reason to have two or that these two things will develop in totally different directions, i.e that Citation.cff stays minimal (is meant to), just for citation purposes and that for software one does rich metadata in a publiccode.yml from which one could extract a Citation.cff? Then at some point there will be a publicdata.yml and why should one need a Citation.cff after this, besides as exchange format?

Yes, CFF will stay minimal, as you say, and for citation purposes. FWIW, I think the general tendency in those parts of the scholarly community I have some insight into tend to favour CodeMeta as a general purpose software metadata format, rather than publiccode.yml.

I'm actually looking into some of the (research) software workflows around publication, and how to deal with metadata (from different sources) in a project called HERMES. Perhaps this is interesting for you.