comments on pre-print - Githubissues

Jens Kattge • July 2018 on biorxiv.org

I have read the preprint with great interest and discussed it with the Steering Committee of the TRY Plant Trait Database (www.try-db.org). We decided to provide a concerted comment, because we are concerned that this preprint has the potential to cause considerable confusion within the plant trait community.

We very much appreciate the proposed movement towards a more standardized representation of trait data in ecology. Especially we appreciate the two principles you suggest for trait measurements: (1) keeping the original information while adding standardized terms; and (2) using an agreed vocabulary, i.e. standardized trait definitions, whenever possible.

However, we are afraid that the organization of metadata in the two extensions for occurrence and measurement may lead to a loss of information and to data structures which may not be easily machine readable and hard to integrate into larger databases. We see two problems: (1) It seems that the principle of keeping original terms and data while adding standardized ones is not explicitly foreseen in the extensions: original data seem to be replaced by standardized ones. If there are mistakes in the standardization, original data cannot be reproduced. (2) The extensions are two-dimensional tables with metadata in columns. Plant trait data are often accompanied by a substantial number of metadata to characterize plant growth and measurement conditions. In these cases data providers would either have to fit their metadata to terms that exist in the standard but may not fit 100%, or skip part of the metadata, or add new columns to the tables. In the first two cases some of the original information would be modified or lost, which is a problem for data integrity. The last case would result in dataset-specific extensions with unsupervised, unstructured and potentially high number of columns. In combination with the organization of trait data along observations, we are afraid that the resulting datasets may not be easily machine-readable and thus would be difficult to integrate into larger databases.

An additional disadvantage of the proposed standard is that the original measurement tables need to be transformed into the standard structure. Even if an R tool provides support, this transformation is an additional barrier to making data publicly available. There is wide consensus that such barriers should be avoided.

In the context of plant trait data we strongly support the use of agreed vocabularies, but emphasize the importance of always providing the original terms and values for traits and metadata. According to our experience, common formats of plant trait datasets can easily be imported into an integrated database like TRY. We thus recommend publication of datasets without transformation to the suggested standard format.

In TRY however, we represent measurements of traits and metadata in a long-table format, always keeping the original values while adding standardized terms (see Kattge et al. 2011, Methods in Ecology and Evolution).

The preprint lacks a chapter ‘Discussion’, which would allow discussing advantages and disadvantages of the suggested ecological trait-data standard.

The Steering Committee of the TRY initiative

Jens Kattge Sandra Díaz Sandra Lavorel Colin Prentice Paul Leadley Gerhard Boenisch Christian Wirth

our reply, August 2018

flo schneider Jens Kattge • 10 days ago

Dear Jens and the TRY Steering Committee,

We very much appreciate receiving feedback on our preprint from you. Our intention of posting a pre-print of this paper was to prompt qualified criticism and engage a wider community of trait-based researchers before releasing an initial version (v1.0) of what could be considered a consensus standard vocabulary. In revising the manuscript and the trait-data standard, we are going to address these comments. Specifically, we will enhance clarity about our intentions and more broadly add a new section in which we discuss the advantages and disadvantages of a standard vocabulary for trait data.

We agree that, above all, trait-data should be published, regardless of format. Our proposal addresses the handling of a rapidly increasing number of trait datasheets that are published in non-standardised, open-access file-hosting services. The heterogeneity of their format makes re-using data from various sources very time consuming and – if this situation should remain unresolved – would inhibit large-scale synthesis research on organism traits. While for plant traits and few other taxa these issues have been largely resolved, thanks to the effort of curated databases such as TRY, data heterogeneity remains an impediment in all other organismic groups which lack such infrastructure. In our article we propose incentivising the use of data standards and ontologies in distributed publications, e.g. by providing an R-package, to enable these data to find their way into a semantic web and to be used in future analyses.

Concerning your specific comments: We, too, want to avoid building barriers and do not demand a specific data structure or set constraints over the number of data columns. We discuss two-dimensional data tables for providing trait co-variates/metadata (implicitly with a variable number of columns) simply because it is most common practice in standalone data publications. The use of our standard vocabulary does not preclude the use of additional, non-standard columns when publishing primary data, e.g. to keep the original data or to add further, unforeseen co-variates. But to support the preservation of these data, we will consider implementing explicit terms for user-defined/original values along with standardized values throughout the extensions (as we already do for the core terms).

We hope to maintain this open exchange and welcome any suggestions on improving compatibility with existing trait-data initiatives. We again want to stress that the current vocabulary is open to submissions of new and refinements of existing terms, to cover practices in specific research communities (e.g. plant-trait growth and measurement co-variates).

Best regards,

Florian D. Schneider, Malte Jochum, Gaëtane Le Provost, Andreas Ostrowski, Caterina Penone, David Fichtmüller, Anton Güntsch, Martin M. Gossner, Birgitta König-Ries, Pete Manning, Nadja K. Simons

Pete brought the following notes back from the OpenTraits workshop

Dear Flo et al.

These notes are a combinationof my own thoughts while at the Open Traits meeting and comments made. Please excuse me ifsome of this is covered in the paper- from what I remember it is but it’s a while since I read it and data science is not really my area. Even if soit might be good to check the clarity of the argument on these points.

The paperwould do well to have a discussion session that presents a few caveats

Merging ofmerged datasets * this could be dangerous: The method, although standard is not ‘idiotproof’, as it requires some understanding of the vocabulary and the biology andmethodology of the traits, and some merged datasets it produces could be of low quality if the creater is not an expert in the traits compiled. Itmight also be possible that data from the same source could be repeated/doubled ifnot properly labelled by the users when these merged sets are combined.

There wassome confusion (mostly from Jens) about what the standard was for. It’d be goodto say it is best suited (or particularly needed) for taxa where there is no centralizeddatabase (i.e. most taxa) and for projects where specific traits not present in existing databasesare needed.

There wassome concern that some users might not understand the vocabulary, need forontologies etc (I must admit I do struggle with this aspect of the work myself).So, anything more that can help here and make it simpler is welcome.

There wasalso some confusions/discussion about what the 3 different tables were for andwhy you needed several. My basic understanding, and therefore a basicexplanation if correct, is that the first is elements that are common to alltrait datasets, and the latter are project specific extensions/covariates. Makeit clear that these are flexible and can be integrated and exchanged- notreally too different to TRY

Will Pearse(Utah, and a collaborator of mine- good guy) had created a package forintegrating trait data. MAD is its current name. NATDB was the previous name.He said it was inferior to yours but might be stronger on a couple of aspects(the reading of metadata etc). He was happy for you to have his code on this. Iwould definitely recommend getting in touch with him and bringing in him intothe team. The other person present who I think would be really good is Josh Madin,again seems a good guy and is well respected in the traits world. Works a loton corals, so would highlight the work to the marine world https://jmadinlab.org/

In generala few more co-authors for a wider range of backgrounds would help say this is acommunity consensus standard and not a BE specific project.

There wassome questioning over whether the use of ‘github only’ is a good idea and ifthis could limit uptake and modification by the community. A video tutorial wassuggested but it was generally argued that most likely users would be using it already and this was not a major issue.

There wascuriousity about whether the package would work for their particular case andas a result some examples in the paper might be good. One concern was that itmight not work for traits that are in rate units e.g. mm2 per year. Not sure ifthis is true or not.

Overall, myfeeling was that the traits community felt your work was totally leading and super timely-so it’d be good to get it out asap- I suggest submitting it almost right away.Avoid TRY people as reviewers, and maybe even argue for them to be avoided when submitting.Small bugs can be ironed out as you go I think.

All the best,

Pete

Caterina brought the following notes back from the Open Traits workshop

Hi,

I finally found my notes from the OpenTraits workshop. I told you most of the things, but here are some more:

as we refer to URIs for species and traits, we can do the same for geographic occurrences, based on coordinates. We could work with the BIEN example. Brian Maitner was the one curating this aspect.
apparently GBIF backbone taxonomy and Catalogue of life will merge. EOL uses Catalogue already. We might need to migrate too. But I - guess this will be tackled by taxize. We can chat with Scott Chamberlain about it. Not urgent anyway.
write a function like "taxize" to put TOP into as.traitdata using API: Brian Maitner was interested in helping (I think I told you but wasn't sure)
we should test the units conversion for complicated units, apparently the plant people measure some weird stuff with unusual units.
we should have a column for the authorities in the taxon extension: some species names are the same for plants/insects/bacteria - good point
in the basisOfRecord factors we have: ‘LivingSpecimen’; ‘PreservedSpecimen’; ‘FossilSpecimen’; ‘literatureData’; ‘traitDatabase’; ‘expertKnowledge’. We also need "modelDerived" for values that come from models (e.g. dispersion kernels) - good point

Specifically for the paper:

we should encourage people that use new column names to use the Darwin Core structure to define them
put more emphasis on the fact that it's a living document (on Git)
add a table of pros/cons of different structures and justify more why we chose that one - or just insist more on the fact that our structure is compatible with others (e.g. TRY, CoralTraits).
several people suggested to add some examples, of some cases with simple and complicated data. We could use Nadja's (or Fons) datasets as a simple case and maybe some ant data from Heike or Dani's that has multiple leaves per tree, multiple seasons - I think it's a good idea and we can add that in a supplementary material if we don't have enough space in the main text
someone proposed to make a video - but I have zero skills for this and not really the time neither

Let me know if something is not clear or if you need help for something.

cheers

caterina

fdschneider / bexis_traits

comments on pre-print #28

Jens Kattge • July 2018 on biorxiv.org

our reply, August 2018

Pete brought the following notes back from the OpenTraits workshop