force11 / force11-sciwg

FORCE11 Software Citation Implementation Working Group
https://www.force11.org/group/software-citation-implementation-working-group
BSD 3-Clause "New" or "Revised" License
56 stars 19 forks source link

Citation File Format #24

Open sdruskat opened 6 years ago

sdruskat commented 6 years ago

In the wake of the WSSSPE5.1 discussion and speed blogging group on a standard format for CITATION files, development has started on a human- and machine-readable Citation File Format (CFF) (http://github.com/citation-file-format/citation-file-format).

I'd like to find out/discuss

sdruskat commented 6 years ago

Perhaps fixing sdruskat/citation-file-format#21 first would be a good idea (adding CFF to the CodeMeta crosswalk).

danielskatz commented 6 years ago

While I like this general idea, I also have some thoughts and suggestions:

1 - in https://danielskatzblog.wordpress.com/2017/09/25/software-heritage-and-repository-metadata-a-software-citation-solution/, I discuss software creation metadata vs software usage metadata:

One way to think about this is that there is some metadata that describe properties of the software itself as source code, such as: authors, language, license, version number, location, etc. Let’s call this software creation metadata. And there are also metadata that describe how the code is being used, possibly including how it is built, such as: compiler version, operating system, parallel computing platform, command-line options, etc. Let’s call this software usage metadata.

If the goal of the CFF is to cover the metadata needed for citation, it seems that there is far too much information being overloaded here. Items such as programming language, person roles, and references are not needed for citation, though they may be useful in other contexts.

2 - One such context is transitive credit, following the example in http://doi.org/10.5334/jors.by While we provided our sample data in JSON-LD, it might be useful to compare your YAML data with ours, and see what differences there are.

3 - If this information was stored in the DOI metadata, the existing converter in https://citation.crosscite.org/docs.html could be used to generate the citation in many formats

For example:

curl -LH "Accept: application/x-bibtex" https://doi.org/10.1016/j.parco.2011.05.005

returns:

@Article{Wilde_2011, doi = {10.1016/j.parco.2011.05.005}, url = {https://doi.org/10.1016%2Fj.parco.2011.05.005}, year = 2011, month = {sep}, publisher = {Elsevier {BV}}, volume = {37}, number = {9}, pages = {633--652}, author = {Michael Wilde and Mihael Hategan and Justin M. Wozniak and Ben Clifford and Daniel S. Katz and Ian Foster}, title = {Swift: A language for distributed parallel scripting}, journal = {Parallel Computing} }

while

curl -LH "Accept: text/x-bibliography; style=apa" https://doi.org/10.1016/j.parco.2011.05.005

returns

Wilde, M., Hategan, M., Wozniak, J. M., Clifford, B., Katz, D. S., & Foster, I. (2011). Swift: A language for distributed parallel scripting. Parallel Computing, 37(9), 633–652. doi:10.1016/j.parco.2011.05.005

4 - going back to my blog, I also suggested:

the authors of the software who want to be cited can fill this gap, and could do so relatively easily. They just would need to create a single metadata file in the root of their repository, with an agreed upon name.

The first time I heard this, it was suggested by Martin Fenner, based on work done in the CodeMeta project, which has the goal of creating a minimal metadata schema for science software and code, in JSON and XML. Martin provided an example of how this could be done: the codemeta.json file in the repository https://github.com/datacite/maremma. According to Martin, the process by which DataCite today could generate a DOI and a citation from this is semi-manual and involves using https://github.com/datacite/bolognese for DataCite XML generation.

If code developers created a codemeta.json file in their repository when they started working on their project, they would then just need to keep it up to data, much like they do their README (description of their project) or CONTRIBUTORS (who has contributed to the project) files, and they might not need to create a CITATION (how the project should be cited) file. Or, the CONTRIBUTORS and CITATION file could be generated from the codemeta.json as part of continuous integration, or as part of releasing or packaging.

sdruskat commented 6 years ago

Thanks, @danielskatz, for your comments!

Re 1

Agreed. In that respect the current specs are neither fish nor fowl, I'll say more about it in Re 2.

My gut feeling now is that it would probably be good to back-pedal to a minimal format to cover citation cases, and develop anything on top of it in an extension, so CFF Transitive or whatever I/we decide it should cover (it should be easy enough to support both from the same infrastructural end points, as in parsers, converters, validators, etc.).

As for the keys you mention, references I think is something that the format should support. To reiterate quickly, CFF files represent YAML maps with three keys, cff-version, message, references, where the value for references is a list of citation metadata sets. I have decided to implement it like this because I wanted to re-model the usage of plain-text CITATION files, which may specify more than one set of citation metadata for a software, e.g., a version and a paper. I still think that this is a valid use case, but I guess this should be more visible in the specs.

Re 2

This will be in the SSI blog post from WSSSPE5.1, but should probably be mentioned explicitly in the specs as well: CFF is definitely a compromise between plain text CITATION files and a more ideal state (a transitive credit system, such as described in your JORS paper). The relative human-readability offered by YAML is very compelling I think, especially if you have human actors in the usage chain of such data (e.g., for CITATION files being delivered with distributions of a software).

I'd be very interested in investigating if what your JSON-LD can do would be possible in YAML. If feasible, an implementation should be done in an extension of CFF (cf. Re 1) which could be merged into the minimal format later.

Re 3

I'm not quite sure I understand what this would mean for CFF (perhaps specify what information you mean by "this information"?). If you suggest for CFF to become a supported format for the converter, I think that this would be optimal.

Re 4

I think what sets CFF files apart from codemeta.json files as of now is a) greater human-readability and b) integration of more than one set of citation metadata in one file. Both make up CFF's quality as a low-threshold compromise. Compromise here mainly because it doesn't enforce the citation software principles as it allows software to be cited only via software papers, etc. Does this make it "incompatible" with the WG/principles? If it does, may it still be worth pursuing as a first-step implementation for what the "What we need" section in the README specifies as "file with citation metadata in bibtex or json format (e.g. codemeta) in code repository root"?

For cases where there is only one reference, namely a software version, the file could easily be created on the fly from a codemeta.json file.


In summary think it makes sense to back-pedal to specs for a minimal format covering only citation metadata. If feasible, an extension could be specified that would implement transitive credit. If possible, CFF should be supported by the CrossCite converter. CFF should also be integrated with CodeMeta via the crosswalk. I need to investigate what the multiple-references property of CFF means for the latter.

I'm thankful for any further comments.

danielskatz commented 6 years ago

Thanks @sdruskat - what you say makes sense, given your explanation, and particularly, the idea of compromising between an ideal and what is practical. I wonder if it would make sense to add more of this motivation/landscape in Citation File Format (CFF)?

I don't have any particular stake in JSON-LD vs YAML, more just wondering about what we used in the JSON-LD example for transitive credit vs what you were using in YAML for citation & transitive credit.

Re 3, I think all I meant was that if instead of YAML you used schema.org, you would be able to take advantage of these converters. I don't really know who wrote them or maintains them (though @mfenner might), and perhaps you could add YAML to them via a pull request somewhere...

Finally, please don't take anything I said as the way things will be - this is new to all of us, and we are building up a common understanding from a bunch of different work and ideas. Your contributions help us along that path.

sdruskat commented 6 years ago

Thanks, Dan!

1- I'll definitely motivate the format better in the specs, and also describe relations to other projects.

2- I'll also have a closer look at the JSON-LD/YAML complex once I have a relatively stable version of core CFF and can start thinking about the extension implementing more of a transitive credit model.

3- I'll investigate integrating with CrossCite converters (https://github.com/citation-file-format/citation-file-format/issues/26). This would be a great feature to have, for adoption and creating tooling.

4- Thanks, I very much appreciate the note in your last paragraph. If I can make a contribution I'm happy, and your comments have already helped in clarifying some issues related to purpose and potential of CFF.

npch commented 6 years ago

You might also like to look at Citation Style Language, and see if you can express CFF in that form,

sdruskat commented 6 years ago

I've created some (and may create some more, depending on progress) hackable issues in the CFF repository which I'd like to invite people to work on during the hackathon. I think it makes sense to work on them in the following proposed order, but please feel free to start work on any of them if you think differently. Also, please feel free to open new issues (hackable and other) against the repo https://github.com/citation-file-format/citation-file-format.

  1. https://github.com/citation-file-format/citation-file-format/issues/29
  2. https://github.com/citation-file-format/citation-file-format/issues/30
  3. https://github.com/citation-file-format/citation-file-format/issues/21
  4. https://github.com/citation-file-format/citation-file-format/issues/14
  5. https://github.com/citation-file-format/citation-file-format/issues/31
sdruskat commented 6 years ago

codemeta/codemeta#170 is a PR that adds CFF to the CodeMeta crosswalk table.

sdruskat commented 6 years ago

Next steps (tracked in https://github.com/citation-file-format/citation-file-format/issues):