add field to reference data usage citations - Githubissues

NCEAS / eml

Ecological Metadata Language (EML)

https://eml.ecoinformatics.org/

GNU General Public License v2.0

40 stars 15 forks source link

add field to reference data usage citations #259

Closed mbjones closed 6 years ago

mbjones commented 7 years ago

Author Name: Matt Jones (Matt Jones) Original Redmine Issue: 6283, https://projects.ecoinformatics.org/ecoinfo/issues/6283 Original Date: 2013-12-06 Original Assignee: Matt Jones

Consider adding an optional top level field to eml-dataset to provide this, possibly something like:

@/eml/dataset/dataUsageCitation which would be of type CitationType @

See discussion on eml-dev regarding this issue: http://lists.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/2013-December/002004.html

mbjones commented 7 years ago

Original Redmine Comment Author Name: Matt Jones (Matt Jones) Original Date: 2013-12-11T18:41:41Z

Here's a proposed element definition for the /eml/dataset/citation field. I have preliminarily checked this into EML trunk (r2344) for incorporation in the next release. As it is optional and at the end of the dataset fields, it should be fully backward compatible with prior versions of EML.

<xs:element name="citation" type="cit:CitationType" minOccurs="0" maxOccurs="unbounded">
  <xs:annotation>
    <xs:appinfo>
       <doc:tooltip>Data Citation</doc:tooltip>
       <doc:summary>A citation to articles or products in which the
       dataset is used or referenced.</doc:summary>
       <doc:description>A citation to articles or products in which the
       dataset is used or referenced. The citation element contains 
       general information about a literature resource that has used or
       references this dataset resource.
       </doc:description>
    </xs:appinfo>
  </xs:annotation>
</xs:element>

mbjones commented 7 years ago

Checking this, it has already been merged into both master and the BRANCH_EML_2_2 branch with SHA 1ad87d39. So can be closed and reviewed for release.

csjx commented 7 years ago

After re-reading the thread in the eml-dev email list, I think we may need to re-open this issue in order to iron out the definition of this element. Carl (@cboettig) raised the issue that the definition should a reference to the "canonical" paper associated with the dataset. Margaret (@mobb) and Wade brought up the issue that citing all articles that use this dataset is not realistic in that the information will go stale quickly. Before we write this in stone, let's be sure it's clearly defined. Carl, can you suggest improvements on definition?

mbjones commented 7 years ago

Here's the link to that thread in eml-dev for reference: http://lists.nceas.ucsb.edu/ecoinformatics/pipermail/eml-dev/2013-December/thread.html#2004

cboettig commented 7 years ago

Thanks for revisiting this.

I still think this is an important issue, but difficult to implement satisfactorily because the whole idea is more of a practical hack than a technically precise idea. We struggled with this recently in codeMeta, and settled on a user's suggestion of referencePublication, https://github.com/codemeta/codemeta/issues/144.

To be clear, I think there's a difference in the mind of many researchers between a publication that is closely connected with the creation of a dataset, and other subsequent papers that also "use" the same data. This is the notion of "canonical" that I'm trying to get at which I think is not reflected in https://projects.ecoinformatics.org/ecoinfo/issues/6283.

I do not think a "canonical" citation link gets stale in the same way as a list of 'publications that use the data' do.

In Dryad, the concept is well-defined and clearly advertised on every data page: "Please cite the following publication as well as the dataset", since all Dryad datasets must be associated with a unique publication. It's unclear how to define this connection in EML.

In general, it seems this concept is a hack, best summarized as "please cite the following paper because the powers-at-be care a lot more about how many citations my papers get than my data, even though semantically / logically citing the data is more meaningful". The real problem of course is that the notion of "citation" is both semantically vague and fundamentally overloaded as a tool for communicating a provenance relationship and a metric for quality. I see the Dryad policy as essentially trying to split these roles: cite here (a paper) for metrics, cite here (data) for provenance, but clearly leaves something to be desired.

Such a 'canonical' paper isn't purely a citation bucket of course, e.g. it is probably also a description of the collection, quality control etc of the data, e.g. is essentially part of the metadata record, but I think EML already provides a mechanism to indicate that.

p.s. isn't there an unrelated issue here about how to associate two top-level EML objects? (e.g. dataset, software, literature, protocols). Seems like it might be reasonable to want to have both in the same EML document, or at least have a good vocab for expressing how they relate (or maybe ORE / PROV is already the solution there).

mbjones commented 7 years ago

This citation field defined here is not meant to be a canonical citation field, but rather a data usage citation, defined as A citation to articles or products in which the dataset is used or referenced.

I'm really not too enthusiastic about a canonical citation that is independent of our existing citation fields. Currently, an EML document contains all of the info needed to cite a dataset. A canonical citation that was separate from these core bibliographic fields would be redundant, and therefore would introduce confusion if the fields differed. Typically, this would be in the following format (or equivalent using the same fields):

Creator(s). pubDate. Title. Publisher. packageId.

For a specific example from an EML document in the Arctic Data Center:

Dr. Matt Nolan, Austin S. Post, William Hauer, Alexander Zinck, and Shad O'Neel. 2017. Photogrammetric scans of aerial photographs of North American glaciers, 1975. Roll 2 jpegs. Arctic Data Center. doi:10.18739/A2H21W.

Maybe we should call the element usageCitation rather than just citation to make the intent of the field clearer.

cboettig commented 7 years ago

I agree with all of the previous comments that it seems somewhat backwards for the dataset metadata to provide a record of what papers have cited it. It's hard to imagine this being up-to-date as a metric of who has cited the data, or particularly useful to someone else using the data.

I also agree that the notice Dryad pastes on every dataset:

When using this data, please cite the original publication:

Morales MA, Zink AG (2017) Mechanisms of aggregation in an ant-tended treehopper: Attraction to mutualists is balanced by conspecific competition. PLOS ONE 12(7): e0181429. http://dx.doi.org/10.1371/journal.pone.0181429

Additionally, please cite the Dryad data package:

Morales MA, Zink AG (2017) Data from: Mechanisms of aggregation in an ant-tended treehopper: attraction to mutualists is balanced by conspecific competition. Dryad Digital Repository. http://dx.doi.org/10.5061/dryad.6pt0m

is redundant, (particularly given the Dryad mechaism for titles for data!) and if someone is using the data it would seem logical to just cite the data. (Or arguably, just cite the paper, which has the citation to the data inside it).

Yet despite embracing a very minimal metadata model elsewhere, Dryad clearly thought this concept was practically important to the community it wanted to serve, redundant or not. To the extent that this approach creates confusion, that confusion it is already here whether or not we can express it in metadata. It would seem for some fraction of the community, "please cite X paper" is a meaningful bit of metadata, just as citing for purposes of attribution rather than provenance (e.g. citing 'the original' paper everyone else has also cited, even if you're only familiar with it's contents from other papers you've read) is a recognized norm.

okay </rant>, sorry

mbjones commented 7 years ago

@cboettig I agree that there is no way for the the data usage citation list to remain complete, but the ones that are listed would be definitive examples of usage. The request for the data usage citation list is directly from ESA who wants to require such a list as part of a submission to their revamped journal for data papers, as described in #269. Their idea is that a data set should be demonstrably usable and used before it is published as a data paper, and this field provides the evidence of that. So, I think its important to include it.

I understand where you are coming from on the "please cite X paper" redundancy. Its not in general where I think the community of practice is heading, despite the Dryad example. But let's see if we can get other people to weigh in on whether an additional canonicalCitation field should be added (this discussion should really be in another ticket so as to not confuse the two requests for citations).

mobb commented 7 years ago

I think I would like to know generally how publishers/repositories/libraries record the relationships (between papers that use data and the data itself) before I comment. I think including the element at a high level (eml/dataset) specifically for ESA's use would be a mistake. A group of us at ESIP this week may have a chance to talk about it.

mobb commented 6 years ago

Adding a single element for relating a published paper has one use case (see example 1 below). It works well with datasets where there is a direct correspondence to a research paper. If that is what is intended, best to say so up front.

Further, the documentation should state what type of relationship is expected to be mapped from this node, eg., I believe the simplest one is isCitedBy https://schema.datacite.org/meta/kernel-4.1/doc/DataCite-MetadataKernel_v4.1.pdf (p 26) The EML documentation cannot be the least bit obtuse; that will invite misuse.

Even with that use case, requiring the EML constructor to build an entire citation tree may be hard to defend, when other biblio formats are easier to create, and the simplest representation (the paper's DOI) is not required by the EML citation schema.

For dataset management strategies that are not tightly coupled with research (ie, independent pathways for data and research papers, typical of large research groups like LTER sites), this element will not work for most associations (example 2, below). Those are better done externally. Mainly, it’s a chicken-egg problem: if metadata are immutable, it is impossible to even add this element post-hoc without generating a new DOI and destroying the original linkage from the paper.

Examples:

Here is a dataset that could have used this element, had it existed: https://portal.edirepository.org/nis/mapbrowse?scope=knb-lter-sbc&identifier=100 The reference to the paper it supports in in the abstract. We waited till the last minute so that we could include the paper’s DOI.
Here is a paper that cites SBC LTER datasets by their DOIs, with the independent pathways for data and research described above. The dataset metadata is immutable, so cannot be augmented with a paper reference. http://dx.doi.org/10.1038/ncomms13757

cboettig commented 6 years ago

I agree with @mobb that this is a very specific use case that should be clearly identified, and that this is not really useful (or even compatible) with data management workflow that are not tightly coupled with particular research. I also agree that EML makes a somewhat cumbersome bibliographic platform, particularly without support for DOIs (personally I'd love to see EML match DataCite schema.org descriptions, guess this could be done via the semantics extension, but that's neither here nor there).

I dont think isCitedBy is nearly specific enough to distinguish between a related paper that contains fundamental metadata about how and why the data were collected (i.e. the use case here), and any other of the myriad reasons one might cite the data (which sounds like way too dynamic a notion to be of any use, as basically everyone in this thread has already said). Of course this kind of information could belong in the EML itself (particularly in the appropriate Methods section, abstract, etc), but for many records -- e.g. all of Dryad -- it isn't, and researchers may feel they have good reasons to prefer to put that material in a "paper", just as EML allows them to keep all of the raw "data" files as separate files which it merely describes in hands-off neutral manner, without any attempt to insist those files are in some normalized or standard format. Maybe it would be more explicit to introduce such an element as an external resource that is part of methods (which can pretty much already be done in EML, if somewhat opaquely).

mbjones commented 6 years ago

@cboettig wrote:

I dont think isCitedBy is nearly specific enough to distinguish between a related paper that contains fundamental metadata about how and why the data were collected (i.e. the use case here)

That is not the use case we are trying to support here. That idea of a referencePublication is a use case covered in issue #277. Let's please discuss issues around 'related' and 'reference' publications, such as how Dryad lists a reference publication, in that issue, and not here.

This ticket is to discuss usageCitation, which is explicitly intended to allow a non-comprehensive list of citations in which the data were explicitly used. There should be no ambiguity about the semantics of the field, in that only works in which the data were actually used should be included as a usageCitation.

Also, I agree that this list will never be comprehensive. Groups like Make Data Count are working on building services to collate lists of citations to data sets. We will always need external services like that. However, its also reasonable to allow a dataset author to explicitly indicate one or more usageCitation examples, especially given that they were likely the first to use the data and the papers probably have particular relevance to understanding the data set. The ESA committee focused on data citation thinks this is critical metadata to understand how to use a data set, and so I am strongly inclined to include this element to facilitate that community application of EML.