conda / conda-build

Commands and tools for building conda packages
https://docs.conda.io/projects/conda-build/
Other
383 stars 424 forks source link

`about: cite-as`: citation information in package meta data #2630

Closed epruesse closed 2 years ago

epruesse commented 6 years ago

Currently, the about: section in the meta.yaml describes the package (summary, description, home), it's license (license, license_file, license_family, license_url), documentation (doc_url, readme) and developer information (dev_url).

There are various additional fields that could be put there. Structured data formats to describe a software package have been explored before e.g. https://wiki.debian.org/UpstreamMetadata or http://oss-watch.ac.uk/resources/doap. Since the about section is already populated only sparsely in many cases, the cost of adding a few fields is low - the decision which fields are mandatory can be left to the channels.

One piece of information extremely useful to have in the context of scientific applications would be publication information. Since there may be many publications associated with a software package, I would suggest calling the field cite-as and just listing the one publication the authors of the software currently ask to be cited.

Since citation references are themselves structured, it makes sense to split the information into subfields, e.g.

about:
    cite-as:
        authors: A. Kabelkanal, B. Maintenance
        title: On ducts and cables, an overview
        publisher: Oxford Industries
        journal: Bioproducts
        year: 2020
        month: May
        volume: 29304
        issue: 4002
        page: 10--104
        DOI: 10.99999/Bioproducts/BPX402
        URL: https://localhost:8888/pub.pdf

In most cases, it would be sufficient to fill the DOI as everything else can easily be filled from there. doi.org could be queried during package build (see https://citation.crosscite.org/docs.html) and the resulting meta data filed with the package:

curl -LH 'Accept: application/json' https://doi.org/10.1093/bioinformatics/bts252
curl -LH 'Accept: application/x-bibtex' https://doi.org/10.1093/bioinformatics/bts252

Technically, the DOI itself would suffice to implement things like conda env bibliography -n mouse_analysis --format apa. Storing all meta data and displaying it on anaconda should increase search engine visibility of the package, though, and allows adding references for which no DOI is available (besides not relying on the availability of an external resource).

On the implementation side, it would probably suffice as a first step to add a cite-as field (alternative names, if you guys prefer, could be citation or reference) to metadata.py. I'm not sure whether it converts the entire tree to json though. A second step might be querying doi.org if only DOI is present and auto-filling the other fields. Having a list of permissible fields and types for the cite-as field would probably be good as well.

msarahan commented 6 years ago

Seems like a good idea. Does R do anything major about this? They are more academically-skewed, and may be a good model here.

If you want to see this, you'll need to submit the PRs for implementing it. I think it's a good idea, but I can't put much effort towards doing it myself.

epruesse commented 6 years ago

Well, R has citation(packagename) which returns a citation object, a subclass of bibentry, so it uses essentially the BibTeX model. The result for ggplot2 looks like this:

> str(citation("ggplot2"))
List of 1
 $ :Classes 'citation', 'bibentry'  hidden list of 1
  ..$ :List of 6
  .. ..$ author   :Class 'person'  hidden list of 1
  .. .. ..$ :List of 5
  .. .. .. ..$ given  : chr "Hadley"
  .. .. .. ..$ family : chr "Wickham"
  .. .. .. ..$ role   : NULL
  .. .. .. ..$ email  : NULL
  .. .. .. ..$ comment: NULL
  .. ..$ title    : chr "ggplot2: Elegant Graphics for Data Analysis"
  .. ..$ publisher: chr "Springer-Verlag New York"
  .. ..$ year     : chr "2009"
  .. ..$ isbn     : chr "978-0-387-98140-6"
  .. ..$ url      : chr "http://ggplot2.org"
  .. ..- attr(*, "bibtype")= chr "Book"
  .. ..- attr(*, "textVersion")= chr "H. Wickham. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2009."
 - attr(*, "class")= chr [1:2] "citation" "bibentry"
 - attr(*, "mheader")= chr "To cite ggplot2 in publications, please use:"
epruesse commented 6 years ago

Debian specifies these fields for reference:

Reference:
  Author: <please use full names and separate multiple author by the keyword "and">
  Title:
  Journal:
  Year:
  Volume:
  Number:
  Pages:
  DOI:
  PMID:
  URL:
  eprint:

In the case of ggplot2, the upstream meta data in Debian is defined via this file: Contact: Hadley Wickham h.wickham@gmail.com

Contact: Hadley Wickham <h.wickham@gmail.com>
Name: ggplot2
Reference:
  Author: Hadley Wickham
  BookTitle: "ggplot2: elegant graphics for data analysis"
  Publisher: Springer New York
  Year: 2009
  ISBN: 978-0-387-98140-6
  URL: http://had.co.nz/ggplot2/book

The data returned is a string though, so not easily parsed.

epruesse commented 6 years ago

If you want to see this, you'll need to submit the PRs for implementing it. I think it's a good idea, but I can't put much effort towards doing it myself.

@msarahan I can see what I can fit in. I'd like a little more brainstorming on what structure the data should have, as this type of thing is painful to change later. I'll crosspost an issue at Bioconda and see what the core team there thinks.

github-actions[bot] commented 2 years ago

Hi there, thank you for your contribution!

This issue has been automatically marked as stale because it has not had recent activity. It will be closed automatically if no further activity occurs.

If you would like this issue to remain open please:

  1. Verify that you can still reproduce the issue at hand
  2. Comment that the issue is still reproducible and include:
    • What OS and version you reproduced the issue on
    • What steps you followed to reproduce the issue

NOTE: If this issue was closed prematurely, please leave a comment.

Thanks!