citation-file-format / citation-file-format

The Citation File Format lets you provide citation metadata for software or datasets in plaintext files that are easy to read by both humans and machines.
http://citation-file-format.github.io
Creative Commons Attribution 4.0 International
445 stars 108 forks source link

Add identifier term that isn't a doi #85

Closed moranegg closed 4 years ago

moranegg commented 5 years ago

Hello, I wanted to recommend the usage of CITATION.cff in academic workflows, but I was surprised that there isn't any term for software identification that isn't a DOI. Even if the usage of a DOI is common and is suggested as a best practice by the SCIWG, there are other ways to identify software, for example: Wikidata entity identifiers (Q828742 for Scilab at https://www.wikidata.org/wiki/Q828742) ascl identifiers (ascl:1908 at http://ascl.net/1905.018) swMath identifier (swMath-id:834 for Scilab https://swmath.org/software/834) handle sha1 and the swh identifier (hashed and resolved by Software Heritage) Here is the last version captured on SWH: swh:1:rev:91fbfb81b72b3d09175fec6a06a5567dac820bd5;origin=https://github.com/scilab/scilab

The panorama of identifiers is broad and will be discussed on the joint FORCE11 & RDA Software Source Code Identification Working Group SCID WG I would like to propose having a term that can contain a list of identifiers, this will enable capturing more metadata and linking it accordingly on different platforms.

What do you think?

jspaaks commented 5 years ago

Hi Morane, thanks for your comment. I agree that the schema should allow for other persistent identifiers than only DOIs. I'm not sure though if a list of persistent identifiers is the way to go. The reason being that while the list items are surely intended to point to various copies of the same thing, I anticipate there will be inconsistencies between the copies, which defeats the purpose of persistent identifiers. Also, from the perspective of whoever is uploading the digital object, one copy is probably preferable.

Be aware that I anticipate we will only be able to deal with this issue after we are done with https://github.com/citation-file-format/citation-file-format/issues/71 and the corresponding PR https://github.com/citation-file-format/citation-file-format/pull/77, which have us near the point of choking... @sdruskat and me are planning to set aside some time in July to work on those.

TBC!

moranegg commented 5 years ago

Hi @jspaaks ! Thank you so much for your quick answer. First it's great we agree that the schema should allow different persistent identifiers. The term used on CodeMeta is identifier and it might be easier to use the same term. About the list, you wrote:

I'm not sure though if a list of persistent identifiers is the way to go. The reason being that while the list items are surely intended to point to various copies of the same thing, I anticipate there will be inconsistencies between the copies, which defeats the purpose of persistent identifiers. Also, from the perspective of whoever is uploading the digital object, one copy is probably preferable. I would love to continue this discussion!

The identifiers I have put as examples, are in most part project identifiers that do not identify a specific version. This point is a great debate on how we should identify software, do we produce a high level identifier like a doi for each artifact and each version of a software even if the metadata stay unchanged? The problem of identifying a version is that by changing the content in the CITATION.cff file, the version changes as well, a problematic race condition. I understand this might not be a good place for the debate and we should wait for the SCID WG recommendations. So if you can add the identifier term for a unique PID that isn't a DOI, this would be satisfying. This is not urgent, but I can't recommend CFF before that, because we do not work with DOI identifiers.

Cheers,

jspaaks commented 5 years ago

I agree CFF should also use the identifier keyword if we can.

BTW the race condition you speak of has been noted, we discussed it here https://github.com/research-software-directory/research-software-directory/issues/77 That thread offers some partial solutions, but note that the "admin interface" I mention there is something specific to the Research Software Directory, so those options are not viable for a general solution.

It is also the reason why I included the --ignore-suspect-keys in cffconvert:

Usage: cffconvert [OPTIONS]

Options:
  -if, --infile TEXT          Path to the CITATION.cff input file.
  -of, --outfile TEXT         Path to the output file.
  -f, --outputformat TEXT     Output format:
                              bibtex|cff|codemeta|endnote|ris|zenodo
  -u, --url TEXT              URL of the repo containing the CITATION.cff
                              (currently only github.com is supported; may
                              include branch name, commit sha, tag name). For
                              example: 'https://github.com/citation-file-
                              format/cff-converter-python' or
                              'https://github.com/citation-file-format/cff-
                              converter-python/tree/master'
  --validate                  Validate the CITATION.cff found at the URL or
                              supplied through '--infile'
  -ig, --ignore-suspect-keys  If True, ignore any keys from CITATION.cff that
                              are likely out of date, such as 'commit', 'date-
                              released', 'doi', and 'version'.
  --verbose                   Provide feedback on what was entered.
  --version                   Print version and exit.
  --help                      Show this message and exit.

For the moment, we chose to let the Research Software Directory use cffconvert with the --ignore-suspect-keys flag, and then we add the missing doi, version, and date-released from Zenodo's metadata.

moranegg commented 5 years ago

Thanks for the explanation about the tool and how you deal with the race condition.

I'll be watching for updates on the identifierterm.

Anyway, I think that CITATION.cff is a more friendly way of keeping metadata and it can be of great use in the scientific community. Especially in publication and archival workflows. Here is an example of an archived piece of scientific software in the French national archive (HAL): https://hal.inria.fr/hal-01882337 The metadata is inserted by hand and transmitted to Software Heritage in CodeMeta vocabulary. It can be a great asset to have the metadata included in the content in a friendly format.

sdruskat commented 5 years ago

Thanks @moranegg! I've spoken to people from other communities as well, and identifierwill definitely come in the next version, as its badly needed.

moranegg commented 5 years ago

Thank you @sdruskat, I will be waiting for the next version.

sdruskat commented 4 years ago

Further to this, @mdolling-gfz has suggested to type the identifier in a separate field. His suggestion was to do

identifier:
  - type: DOI
  - path: foo.bar

I shy away from introducing that extra level and think that - while having an identifier type would certainly be helpful to make the identifier (more) machine-actionable, an extra field would be enough so:

identifier: 10.1234/1m4D01
identifier-type: doi

Ideally, the identifier-type value would be an enum. I guess we could start by building a list of common identifiers, then accept others via issue/PR.

What do you think, @jspaaks & @moranegg?

moranegg commented 4 years ago

I really appreciate taking the time to ask here.

I agree with @mdolling-gfz and I will back that with other uses in different vocabularies.

  1. From schema.org Software Source Code class (https://schema.org/SoftwareSourceCode):
identifier PropertyValue  or Text  or URL The identifier property represents any kind of identifier for any kind of Thing, such as ISBNs, GTIN codes, UUIDs etc. Schema.org provides dedicated properties for representing many of these, either as textual strings or as URL (URI) links. See background notes for more details.

When using PropertyValue you can distinguish the identifier type:

"identifier": { "@type": "PropertyValue", "propertyID": "OCoLC", "value": "889647468" },

  1. On Wikidata a software item can have many identifiers, including the entity identifier Qxxx Here an example: https://www.wikidata.org/wiki/Q59652265

  2. From CodeMeta ( using schema.org identifier):

"identifier": { "@id": "schema:identifier", "@type": "@id" },

During the Hackathon we discussed this problematic situation of identifying at different levels of granularity and with different types of identifiers, see the identifiers crosswalk outcome

In biblatex, the property name is the identifier-type:

There might be different identifiers to the same thing, and it is difficult to specify which identifier-type is appropriate to which identifier-value even with an enum. It can cause a lot of automatic extraction problems.

To stay compatible with your last version I can propose to keep doi as a separate property and add identifier as a complex item with type and value. Both shouldn't be mandatory.

Thanks again for continuing this discussion.

sdruskat commented 4 years ago

Thanks @moranegg! Makes sense.

What do you and @jspaaks think about introducing identifers (plural) rather than `identifier, as a software can easily have more than one identifier, e.g., a DOI, and a Software Heritage ID (e.g., for a release)?

I.e., have:

identifiers:
  - type: doi
    value: 10.1234/imadoi.9876
  - type: swh-release
    value: swh://imacomplicatedandverylongsoftwareheritagereleaseidentifier

Morane, as far as I understand you argue against enums, so that would make both type and value fields that accept simple strings?

jspaaks commented 4 years ago

I think the identifiers key from https://github.com/citation-file-format/citation-file-format/issues/85#issuecomment-551016103 would be an improvement, but I stand by my comment about specificity. Supposing the list of identifiers are pointers to copies of the same thing, an advantage would be redundance, so if one resource disappears, there's still other copies. The downsides are: that there will be differences between copies, and that there is more of a burden on people supplying content, and more choices to confuse potential uploaders. Note that such confusion may well lead to people not uploading anything.

I guess we could leave it to the judgment of whoever is entering data what they feel is appropriate--with identifiers a list, at least they have the opportunity to enter multiple pointers, but they aren't required to do so.

For type, an enum would help the development of any code that uses CFF files, because you can just implement a simple switch in your code based off of the enum value. This also helps in communicating what is and what isn't supported in certain software. In order to retain flexibility, we could introduce other as an element of the enum to catch any previously undefined identifiers. Software developers wanting to support other would then write code with a switch case to other after which they can start trying to infer what the appropriate way to handle the unknown indentifier is (this kind of guessing usually yields code that is difficult to read and maintain).

For value, I would like to see if we get by with just strings (we can narrow it down if need be later, that's one reason why CFF files have cff-version after all).

sdruskat commented 4 years ago

So after the discussion above, I assume that the following will fix this issue:

  1. Have a non-required field/term identifiers, which is a list of identifier objects.
    • An identifier object consists of two required fields/terms:
      1. type: An enum from a list of supported identifier types (e.g., software-heritage identifiers) + other
      2. value: A string representing the value of the identifier, e.g., swh:1:rel:99f6850374dc6597af01bd0ee1d3fc0699301b9f, which points to the 0.5.0 release of duecredit.
  2. This is implemented in the schema
  3. This is documented in the README = the specs
  4. The process of adding a new identifier type to be supported (i.e., via an issue (and optionally a PR) against this repo) can be documented.

I'll write a PR and ask @moranegg and @jspaaks to review.

This doesn't really solve the chicken-egg issue a.k.a. race condition, but this may be something that CFF cannot tackle.

jspaaks commented 4 years ago

Hi @sdruskat, sounds good to me. A suggestion on what the type, value combination should look like: I think we should try to make it such that type points to an entity for whom value is meaningful. As an example,

--- 
identifiers: 
  - 
    type: "https://doi.org"
    value: 10.1038/533452a

If we go this way, maybe type should be named differently, maybe resolver or authority or something.

Additionally, we could have shorthand notation for type, for example doi could replace the url in the snippet above.

Not sure how this system works for things that are not very specific, like a GitHub URL,

--- 
identifiers: 
  - 
    type: "https://github.com"
    value: "citation-file-format/citation-file-format/issues/85#issuecomment-552450319"

seems a bit contrived.

moranegg commented 4 years ago

This looks good, I agree that software can have more than one identifier, so yes for identifiers as a property with type-value tuples. This is the same pattern with authors in CFF compared to multiple entries of author in CodeMeta.

To extend @jspaaks example:

identifiers:
  - type: https://archive.softwareheritage.org/
  - value: swh:1:snp:db942cd85528df109fea1e483c71a08e53523554;origin=https://github.com/sagemath/sage/

  - type: https://github.com/
  - value: sagemath/sage/

  - type: https://swMath.org/software/
  - value: 825