FAIR: New definition of authorship in MassBank Part 1: Enhance the presentation of author names in MassBank records to achive a machine readible format #194

Open meowcat opened 4 years ago

meowcat commented 4 years ago

Could we devise a better specification for the AUTHOR field in the MassBank record? Currently:

Authors and Affiliations of MassBank Record. Mandatory Example: AUTHORS: Akimoto N, Grad Sch Pharm Sci, Kyoto Univ and Maoka T, Res Inst Prod Dev.

In particular, can we specify who should be in the author list, or differentiate who contributed the data and who made the records?

In "Eawag additional specs", we have situations where the record creator and uploader is not an author on the PUBLICATION. (example) My understanding is that we just added the record creator to a subset of the paper authors, somewhere between first and last author. I don't see this as a really clear and transparent solution, since as the record creator I wouldn't want to steal authority from the actual paper authors.

On the other hand, for the MetaboLights records (example), which I created from publicly available data, I am not listed at all. This is also not ideal, since the listed AUTHORS may not even know about this record existing and should not be held responsible for problems in it, e.g. if processing went wrong.

I suggest to allow the use of MARC relator terms. For example, [dtc] for where the data comes from, and [com] for who made the record.

This is how the terms are specifically defined for R packages:

 Data contributor [dtc]
    A person or organization that submits data for inclusion in a database or other collection of data 
 Compiler [com] 
 A person, family, or organization responsible for creating a new work (e.g., a bibliography, a directory) through the act of compilation, e.g., selecting, arranging, aggregating, and editing data, information, etc 

(This is already a loose application of [dtc] in the MetaboLights case since they did not actively submit the data; but it's the one that makes most sense, it seems)

There is also e.g.

 Annotator [ann]
    A person who makes manuscript annotations on an item
 Curator [cur]
    A person, family, or organization conceiving, aggregating, and/or organizing an exhibition, collection, or other item 
 Abridger [abr]
    A person, family, or organization contributing to a resource by shortening or condensing the original work but leaving the nature and content of the original work substantially unchanged. For substantial modifications that result in the creation of a new work, see author 

In R usage, [cre] is the package maintainer, but the MARC definition is

 Creator [cre]
    A person or organization responsible for the intellectual or artistic content of a resource 

so I would leave this out probably...

meowcat commented 4 years ago

Another one:

 Editor [edt]
    A person, family, or organization contributing to a resource by revising or elucidating the content, e.g., adding an introduction, notes, or other critical matter. An editor may also prepare a resource for production, publication, or distribution. For major revisions, adaptations, etc., that substantially change the nature and content of the original work, resulting in a new work, see author 
meowcat commented 4 years ago

Last one, then I leave you alone

 Metadata contact [mdc]
    A person or organization primarily responsible for compiling and maintaining the original description of a metadata set (e.g., geospatial metadata set) 
tsufz commented 4 years ago

I agree with @meowcat.

meowcat commented 4 years ago

Should I propose a PR to the record format description, proposing the use of [dtc] and [com] if appropriate?

meier-rene commented 4 years ago

You are welcome to propose a PR for the record format description. But, if I understood it correctly, this will be a change which breaks compatibility. This means it will take some time until its included in all neede places. With the formal description I can implement it on the codebase. If the code is working I need to convert all existing data to the new format as well. and finally we need to incorporate the changes to RMassBank, which I cant do.

meowcat commented 4 years ago

I don't believe this would break compatibility; it would merely explicitely foresee the possibility to allow putting MARC relator tags behind names. Right now, there is actually no specification at all for how the author names should be presented. So maybe this would require a small adaptation in the validator, but otherwise I don't think a change is required...

sneumann commented 4 years ago

Do we have a pointer on how R package descriptions do that ? There is a machine readable Authors@R: c(person(given = "John Doe", email = "...", role=c("cre"), ...) and the "compiled" version Author: Jon Doe, Jane Austen, ... we have in I'd like to not change our AUTHOR, but add a machine readable (and compile/renderable) extended version. Yours, Steffen

meier-rene commented 4 years ago

@meowcat Then I probably misunderstood your proposed extension/changes. Could you please give an example?

meowcat commented 4 years ago

@sneumann: I would argue that there is currently no syntax that would be changed, since there is a variety of formats used for the AUTHORS field, with more or less the same info in slightly different iterations: AUTHORS: Stravs M, Schymanski E, Singer H, Department of Environmental Chemistry, Eawag AUTHORS: KOGA M, UNIV. OF OCCUPATIONAL AND ENVIRONMENTAL HEALTH AUTHORS: Markus Kohlhoff, Natural Product Chemistry Lab (CPqRR/FIOCRUZ, Brazil) AUTHORS: Matsuura F, Ohta M, Kittaka M, Faculty of Life Science and Biotechnology, Fukuyama University AUTHORS: Tobias Schulze, Hubert Schupke, Martin Krauss, Department of Effect-Directed Analysis, Helmholtz Centre for Environmental Research GmbH - UFZ, Leipzig, Germany AUTHORS: Mark Earll, Stephan Beisken, EMBL-EBI AUTHORS: Krauss M, Schymanski EL, Weidauer C, Schupke H, UFZ and Eawag AUTHORS: Nikiforos Alygizakis, Anna Bletsou, Nikolaos Thomaidis, University of Athens AUTHORS: C. Gallampois (Umea), E. Schymanski (Eawag), W. Brack (UFZ) AUTHORS: Cuthbertson DJ, Johnson SR, Lange BM, Institute of Biological Chemistry, Washington State University AUTHORS: Evans A M, Mitchell M, DeHaven C D, Barrett T, Milgram E, Metabolon Inc. AUTHORS: Ales Svatos, Ravi Kumar Maddula, MPI for Chemical Ecology, Jena, Germany AUTHORS: Nils Hoffmann, Dominik Kopczynski, Bing Peng AUTHORS: Parejo I, et al. AUTHORS: K.A. Wilkinson & S.N. Miranda

Different order of first and last name, use of brackets, spelling of double initials, use of punctuation, different specifications for the institutes both in format and detail etc. So whatever anyone chooses to put in their AUTHORS doesn't really contradict any existing format or rule. I could go on endless; to my dismay no one is apparently using the semicolon, which I would have wanted since I find it convenient.

My current PR, as a basis for discussion, is mostly a suggestion how people might want to specify authorship, since this would fit in with any scheme people are currently using, and not be more or less machine-readable than before.

I agree that a thought-out new version of AUTHORS (or a substitute) should be machine-readable.

(Ideally we would incorporate ORCID.)

meowcat commented 4 years ago

Ah, I found a few more interesting schemes. Including my semicolon! Note that many of these are actually by "us" as in "the people discussing here". AUTHORS: S. Neumann: IPB-Halle, Germany & E. Schymanski: Eawag, Switzerland AUTHORS: E. Schymanski; retrieved from M. Castillo et al. 2000 AUTHORS: Chandler, C. and Habig, J. Boise State University AUTHORS: Plant Biology, The Noble Foundation, Ardmore, OK, US/Dennis Fine, Daniel Wherritt, and Lloyd Sumner Institute first, and even a slash!

This is not meant to criticise any of the formats that were used, only pointing out the complete absence of anything systematic, even among high-quality and involved contributors.

schymane commented 4 years ago

Yes I totally agree its's a good time to start using some conventions; @MaliRemorker and I were discussing how to write the author statement for the hopefully soon-to-be-coming LCSB records, he's looking into your suggestions our side. I agree with @sneumann that we should retain a plain text AUTHOR field (but add some recommendations for use into the documentation to avoid this in the future) and add a machine readable one as an extra, to retain backwards compatibility and ease-of-use for users. We would then have to decide with @meier-rene whether we standardize this information in already-existing records - at least the plain text field?

ORCIDs ... maybe a separate field?

meowcat commented 4 years ago

While we are on the topic of updating the record format #200 , any further input on this one? #195

tsufz commented 4 years ago

@meowcat, well I integrated your commits in #195 into my file.

tsufz commented 4 years ago

However, for the next Record Format release we should think again about the authorship, ORCID ... However, this should also mapped into RMassBank and MassBank-web before we update the records format. @meowcat and @sneumann could you take lead for this action?

tsufz commented 3 years ago

Well, coming back to this discussion. We should sort out what is the final goal of the changes of the name schema in MassBank. As shown above, this would need major curation efforts in order to translate the existing author lists to the new schema. For me, a major goal is to make the datasets FAIR and searchable, a minor goal are the visibility of contributions. The first is a must, the latter is a nice to have and maybe a providence for future developments.

So far, MassBank author is a similar representation as the creator in Bioschemas or Google. The creator consists of a Person and a Organization including probable identifiers such as ORCID etc.

A example scheme of the author in schemas is:

"creator": [
        "@type": "Person",
        "sameAs": "",
        "givenName": "Jane",
        "familyName": "Foo",
        "name": "Jane Foo"
        "@type": "Person",
        "sameAs": "",
        "givenName": "Jo",
        "familyName": "Bar",
        "name": "Jo Bar"
        "@type": "Organization",
        "sameAs": "",
        "name": "Fictitious Research Consortium"

However, let us ditch deeper into the Dataset scheme. Everything, we want to integrate is already there:

Furthermore, we should consider if we add MassBank as a dataset editor as we curate and impute data were missing and so on. We do not mark that at all. We shall do it with respect to transparency.

IMHO, we should strictly follow the schemas types and do not add an list of new (unrecognized) types a mentioned by @meowcat. For me, a data provider is a contributor or even an author of a dataset. Without the raw mass spec data provided, the MassBank record would not exist at all, that is the point. A compiler or curator may fall also in the role of creator, maintainer, editor...

I don't also think that is useful to add contact persons, emails etc. The affiliations and roles of people could change quickly and the data will be get outdated. Furthermore, this would require a data agreement with all contributors. With respect to curation efforts and credits to data minimization, I am strictly against the integration of too many personal data.

In most cases, the contributions are related to single (PhD) projects without follow up. This will hopefully change with NFDI4Chem. NFDI4Chem is also the framework in which we will think about necessary granulation of roles and their implementation in Schemas etc.

I suggest the following roadmap for the implementation of the new authorship model in MassBank:

  1. Draft and discuss a new version of the MassBank Record format including a Schemas compliant creator entity.
  2. Implement and test the new creator entity in RMassBank
  3. Implement and test the new creator entity in MassBank #287
meowcat commented 3 years ago

I unfortunately don't understand enough about how these Schemas work together, i.e. what kind of chance needs to be done in the MassBank format to make this work. But clearly, using the MARC relator tags was somewhat of a quick fix that could be done inline with minimal (or really, no) modifications. If there is a good way to make a Schemas-compliant tag, I'm all for it.

I'm more worried (like @meier-rene in #292, if I understand correctly) how the generally simple, line-based record format of MassBank is suited for changes towards more highly-structured content; or what we have to do to make this easier.

tsufz commented 3 years ago

MARC is a librarian standard, no internet standard. Schema is a "project to create, maintain, and promote schemas for structured data on the Internet, on web pages, in email messages, and beyond" (see We are data providers and not publishers. Furthermore, I still think that the MARC categories are too broad. We need to provide simple categories and not too many. Most people will not care on that anyway and quickly are overloaded with meta data requirements. Furthermore, harvesting web services (Google, Wikidata, etc.) will not care on MARC standards, because they follow and expect Schema format. Our format is for example:

"identifier": "AC000372",
"url": "",
"name": "15-Acetyldeoxynivalenol",
"alternateName": ["15-Acetyldeoxynivalenol", "15-monoacetyldeoxynivalenol"],
"description": "This MassBank record with Accession AC000372 contains the MS2 mass spectrum of '15-Acetyldeoxynivalenol'.",
"molecularFormula": "C17H22O7",
"monoisotopicMolecularWeight": "338.13654",
"inChI": "InChI=1S/C17H22O7/c1-8-4-11-16(6-22-9(2)18,13(21)12(8)20)15(3)5-10(19)14(24-11)17(15)7-23-17/h4,10-11,13-14,19,21H,5-7H2,1-3H3/t10-,11-,13-,14-,15-,16-,17+/m1/s1",
"smiles": "CC1=C[C@@H]2[C@]([C@@H](C1=O)O)([C@]3(C[C@H]([C@H]([C@@]34CO4)O2)O)C)COC(=O)C",
"@context": "",
"@type": "MolecularEntity"
"identifier": "AC000372",
"url": "",
"headline": "15-Acetyldeoxynivalenol; LC-APCI-ITFT; MS2; CE: 10; R=17500; [M+H]+",
"name": "15-Acetyldeoxynivalenol",
"description": "This MassBank record with Accession AC000372 contains the MS2 mass spectrum of '15-Acetyldeoxynivalenol'.",
"measurementTechnique": "mass spectrometry",
"datePublished": "2017-07-07",
"license": "",
"citation": "null",
"comment": "CONFIDENCE Reference Standard (Level 1)",
"alternateName": ["15-Acetyldeoxynivalenol", "15-monoacetyldeoxynivalenol"],
"@context": "",
"@type": "Dataset"

Best, Tobias