NCEAS / metacatui

MetacatUI: A client-side web interface for DataONE data repositories
https://nceas.github.io/metacatui
Apache License 2.0
42 stars 24 forks source link

Add taxonId fields to the taxonomic editor #2101

Open robyngit opened 1 year ago

robyngit commented 1 year ago

EML 2.2 provides a taxonId field. A taxon ID is an identifier for a taxon from an authority. For example, AphiaID is the unique identifier of WoRMS; TSN is the unique identifier of ITIS. It is important that we provide a space to enter these IDs in the editor, since taxonomic names do not uniquely identify a taxon.

The field is repeatable, and contains subfields for both the identifier and the taxon database URI (like ITIS), so you you can list multiple taxonId from different data systems where that makes sense (being cautious to not equate different taxa).


Here is some interesting background from @mbjones about why these identifiers are so important:

Taxonomic names do not uniquely identify a taxon concept. For example, the mountain lion used to be placed in the genus Felis (as Felis concolor), but now is in the genus Puma (as Puma concolor). So the name Felis represents different taxon circumscriptions at different points in time. And these two taxon concepts of the mountain lion are not congruent (the old Felis is not equal to the new Felis, and neither are equal to the new Puma concept). Searching for Felis might miss metadata records labeled with Puma. You can reference both the ITIS and NatureServe concept identifiers for this taxon. Search systems should deal with this, but it is complicated, constantly changing, and hard to do correctly. For background, see these papers:

Kennedy, J.B., Kukla, R., Paterson, T. (2005). Scientific Names Are Ambiguous as Identifiers for Biological Taxa: Their Context and Definition Are Required for Accurate Data Integration. In: Ludäscher, B., Raschid, L. (eds) Data Integration in the Life Sciences. DILS 2005. Lecture Notes in Computer Science(), vol 3615. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11530084_8

Remsen D (2016) The use and limits of scientific names in biological informatics. In: Michel E (Ed.) Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond. ZooKeys 550: 207–223. https://doi.org/10.3897/zookeys.550.9546

Various groups have worked on taxonomic concept standards that help clarify and resolve naming ambiguity (such as through the TDGW Taxon Concept Schema (TCS) standard), by being more explicit about the many-to-many relationships between taxa and names. The whole Linnaean system of re-using names of a holotype for new concepts (e.g., a subset of a previous taxon concept) guarantees that name ambiguity will persist.

Practically speaking, what this means is that references to taxonomic names in a metadata record today will be ambiguous in the future unless further detail is provided about the specific taxon concept that was meant by using that name label. This includes rearrangements of taxa in various parent taxa, which are subject to the same issues. So, our best practices are to: 1) unambiguously reference both a taxon name and the identifier of the taxon concept you are referring to; and 2) only directly reference taxa that you directly identified or determined (and not their parent taxa, which differ according to different classifications through time).

robyngit commented 1 year ago

Related: https://github.com/NCEAS/metacatui/issues/200, https://github.com/NCEAS/metacatui/issues/322

yvanlebras commented 1 year ago

Yes, another important feature! Is it also possible to think about proposing an automatic fill of taxon id and name + inference of all upper level taxon id and names as we are doing in EML assembly line / MetaShARK? Proposing authorities like itis, worms, gbif notably and related API?

robyngit commented 1 year ago

Definitely, @yvanlebras, we would like to have a feature that connects to a taxon database API in the taxon editor (planned here https://github.com/NCEAS/metacatui/issues/2090).

Though it's important to note that for the reasons outlined by Matt above, our best practice recommendation is not to fill in all of the upper level taxon ranks:

only directly reference taxa that you directly identified or determined (and not their parent taxa, which differ according to different classifications through time).

yvanlebras commented 1 year ago

Thank you for your feedback! I can understand the metionned point on having all the upper level taxon rank BUT it seems to me this provide much higher capabilities to discover / search / find datasets isn't it ? On way to avoid having it on the metadata document but having possibility to search for can be to have, on metacat/metacatui, some "intelligence" to propose such upper levels "on the fly"... Just 1,5 cents ;) Moreover, if we have the upper level taxon ranks from an authority with a date, we have provenance so it is not so a big isue, "just" reflecting the knwoledge at the time the metadata was generated no ?

robyngit commented 1 year ago

On way to avoid having it on the metadata document but having possibility to search for can be to have, on metacat/metacatui, some "intelligence" to propose such upper levels "on the fly"... J

@yvanlebras you are absolutely correct, in our current interface, filling in only the leaf taxa will limit the findability of datasets. @mbjones and I were very recently discussing the idea of incorporating a taxon-lookup service into the search similar to what you described. It's been a long-term goal, but we haven't developed it yet. I made a new issue for this, please feel free to add any and all ideas and feedback you may have!! :)

Moreover, if we have the upper level taxon ranks from an authority with a date, we have provenance so it is not so a big isue, "just" reflecting the knwoledge at the time the metadata was generated no ?

This is an interesting point, I wonder if @mbjones has feedback on this one.