cf-convention / cf-conventions

AsciiDoc Source
http://cfconventions.org/cf-conventions/cf-conventions
Creative Commons Zero v1.0 Universal
85 stars 43 forks source link

Alignment of new biological taxon standard names (Section 6.1.2) with the biological data standards community #309

Closed albenson-usgs closed 8 months ago

albenson-usgs commented 3 years ago

Title

Alignment of new biological taxon standard names (Section 6.1.2) with the biological data standards community

Moderator

@davidhassell

Moderator Status Review [last updated: YYYY-MM-DD]

Brief comment on current status, update periodically

Technical Proposal Summary

Opportunity to bring CF community and the Biological Data Standards Community (TDWG, https://www.tdwg.org/) into alignment

Benefits

Clarity for data users and data managers that concepts in the CF community are the same as those in the biodiversity information standards community.

Detailed Proposal

Complete proposal

The newly added CF standard names biological_taxon_name and biological_taxon_identifier represent the same information that is currently identified in the Darwin Core biological data standard scientificName and scientificNameID. Since the concepts are the same and these standard names have been in use in the biological data community since 2012, I would like to see an enhancement that CF adopt these existing standard names (changing to follow CF conventions to be scientific_name and scientific_name_id) and link to the Darwin Core standard to promote interoperability between these two communities. Darwin Core is used by the Ocean Biodiversity Information System and the Global Biodiversity Information Facility, among others, and aligning the current CF standard names to existing biological data standards represents an opportunity for these two communities to work together and reduce redundancy throughout biological data systems. Moreover, implementing this enhancement will improve understanding for downstream users or managers of data to be more certain that these concepts are in fact representing the same information.

dblodgett-usgs commented 3 years ago

Hey @albenson-usgs - It's not really clear to me what the specific change to the specification would be. Can you boil your detailed proposal down to a problem / solution?

albenson-usgs commented 3 years ago

Instead of biological_taxon_name -> scientific_name (and link to Darwin Core in the documentation) Instead of biological_taxon_identifier -> scientific_name_id (and link to Darwin Core in the documentation)

dblodgett-usgs commented 3 years ago

And this is in reference to https://github.com/cf-convention/cf-conventions/blob/master/ch06.adoc#taxon-names-and-identifiers

albenson-usgs commented 3 years ago

@davidhassell would you mind being moderator for this?

davidhassell commented 3 years ago

@albenson-usgs - I'm happy to moderate.

davidhassell commented 3 years ago

Is it right that this proposal is essentially for changing the standard names biological_taxon_name (or biological_taxon_lsid - see cf-convention/vocabularies#46) and biological_taxon_identifier to scientific_name and scientific_name_id respectively?

As CF is applicable to many areas of geoscience, standard names are more self-explanatory than would suffice for any one area because they answer the question, “What does this mean?”, rather than the question, “What do we call this?”. It seems that the use "scientific" in the proposed names is not very informative in this context, as it doesn't really tell a third party anything about the data.

The existing names ("biological") seem very understandable from a lay perspective (i.e. mine!), and you say that they are not wrong, so I wonder if this change is required?

Perhaps a connection to Darwin Core couldl be made in the standard name descriptions (which currently mentions WoRMS and ITIS) - would that be appropriate?

It would be very useful to hear from others with expertise in the use of this sort of data.

roy-lowry commented 3 years ago

First thing to say is that this ticket is intimately linked to cf-convention/vocabularies#46 which exposed that biological_taxon_identifer was set up in error - it should have been biological_taxon_lsid. The suggested fix is that biological_taxon_identifier be deprecated and aliased to biological_taxon_lsid.

This request resurrects a discussion on Trac that ran for considerable time and takes the position back at the beginning of that discussion. In a nutshell the initial name proposals were criticised as being too parochial for a multidisciplinary standard like CF. Whilst scientificName works in Darwin Core - a biological standard - there's nothing to tell a non-biologist that scientific_name relates to biology.

My suggestion as a compromise would be to include specific references to the Darwin Core labels in the description. For biological_taxon_name I would suggest:

"Biological taxon" is a name or other label identifying an organism or a group of organisms as belonging to a unit of classification in a hierarchical taxonomy. The quantity with standard name biological_taxon_name is the human-readable label for the taxon such as Calanus finmarchicus. The label should be registered in either WoRMS (http://www.marinespecies.org) or ITIS (https://www.itis.gov/) and spelled exactly as registered. See Section 6.1.2 of the CF convention (version 1.8 or later) for information about biological taxon auxiliary coordinate variables. This is equivalent scientificName in the Darwin Core standard.

I'll do something similar for biological_taxon_lsid when I set that up.

@davidhassell You can compose messages quicker than me but we're saying the same thing!!!

roy-lowry commented 3 years ago

@albenson-usgs A question that's more related to other tickets, but I'll ask it here. How would Darwin Core deal with datasets that include a mixture of true taxa (e.g. Calanus finmarchicus) with functional/morphological groups such as 'prokaryotes', 'nanophytoplankton'? This relates to some concerns I'm having with some other Standard Name tickets (86 and 87). CF is heading down a road of using one convention for taxa (a multi-dimensional storage array with taxon identifiers as a co-ordinate under a single Standard Name) and another for functional/morphological groups (a separate array for each group each with its own Standard Name).

Ticket 86 proposes a further tranche of functional group Standard Names but it includes 'dinoflagellates' (a common name for a Class) and Prochlorococcus (a scientific name for a Genus) mixed in with things like 'cryptophytes' and 'haptophytes'. Would Darwin Core allow entries like 'cryptophytes' under scientificName? If not, how would they be stored under Darwin Core.

roy-lowry commented 3 years ago

@davidhassell How do I link this ticket to Standard Names 86 and 87?

davidhassell commented 3 years ago

Hi Roy - there may be other ways, but it does work if you just put the full URL of the issues in the other repo: cf-convention/vocabularies#29 and https://github.com/cf-convention/vocabularies/issues/105 You can even leave out the "https://github.com/" bit - it will render and link the same.

roy-lowry commented 3 years ago

@davidhassell Many thanks.

albenson-usgs commented 3 years ago

@roy-lowry Yes you can store any scientific name within the taxonomic hierarchy in Darwin Core scientificName and then there are more terms to specify the taxonomy as well as the rank. So for instance you could have: scientificName = "Calanus finmarchicusand then have: scientificNameID= "urn:lsid:marinespecies.org:taxname:104464" taxonRank= "species" kingdom= "Animalia" phylum= "Arthropoda" class= "Crustacea" (and on down to...) specificEpithet` = "finmarchicus"

For something like dinoflagellates it would be: scientificName = "Dinoflagellata" [vernacularName](https://dwc.tdwg.org/terms/#dwc:vernacularName) = "dinoflagellates" scientificNameID = "urn:lsid:marinespecies.org:taxname:146203" taxonRank = "Infraphylum" kingdom = "Chromista" etc

But really you have all you need with just scientificName and scientificNameID since all the rest can be extracted from WoRMS using the LSID. Just showing that you can include a scientific name at any level of the taxonomic hierarchy in Darwin Core and the associate terms. Also GBIF is accepting BOLD and UNITE stable Operational Taxonomic Units (OTUs, eDNA) in scientificName as well (more info on that here). Realize that's more than what you're asking about but just in case it's relevant.

MathewBiddle commented 3 years ago

I think this example DarwinCore Occurrence data file might help the discussion.

albenson-usgs commented 3 years ago

I totally understand what @roy-lowry and @davidhassell are suggesting about standard names being self-explanatory. What I'm ultimately trying to avoid is that people have to put the same information twice to make it clear that they are following both standards. Let's say we have a data manager that has plankton data and they want to make sure their data are CF compliant but that they can also be ingested by global biological data aggregators. They might feel they would need to implement the CF standard name biological_taxon_name = "Calanus finmarchicus" and then also include the Darwin Core scientificName = "Calanus finmarchicus". Also by adopting the Darwin Core term I'm hoping it would promote synergy and collaboration between the two communities.

roy-lowry commented 3 years ago

@albenson-usgs I TOTALLY understand that a scientificName can be of any taxonomic rank. My question that hasn't been answered is what if the dataset includes things that are morphological groups like 'microphytoplankton'. In my experience it is quite common to have these things mixed together with taxa in observational datasets.

albenson-usgs commented 3 years ago

Fair enough. Apologies I misunderstood your question. If there is no logical taxonomic classification for the grouping then that's definitely harder. There was recently discussion about morphospecies in the TDWG Github. I'm not sure that it's exactly the same but maybe it's analogous enough that it begins to address this? https://github.com/tdwg/dwc-qa/issues/162

roy-lowry commented 3 years ago

@albenson-usgs Thanks that thread reinforces my perception that 'taxon' comes with a purity and that mixing morphological terms into 'biological_taxon_name' will cause a semantic divergence between CF and Darwin Core that is far more significant than a label like a Standard Name. Conversely, the approach I've taken with other communities like SeaDataNet is match the standard to the data, allowing mixtures of taxa and groups under the umbrella concept of 'biological entity'. See http://vocab.nerc.ac.uk/collection/S25/current/accepted/ for a listing of what I mean. This suited the requirements of the community we were serving, which was to provide a semantic framework that would cope with any biological or biogeochemical dataset that they threw at it. Darwin Core interoperability wasn't top of the agenda and semantic crosswalks between what we've put together for SeaDataNet and Darwin Core would require work (e.g. building mappings).

CF is at the stage where I thing that interoperability with both SeaDataNet and Darwin Core could be made much easier by making the correct decisions at this stage. I'll think on this further and maybe have some off-line discussions before responding again to cf-convention/vocabularies#29 next week.

roy-lowry commented 3 years ago

@albenson-usgs This is a separate answer to your response to @davidhassell and myself. Vocabulary search engines in the past decade have been designed to search by default both labels and descriptions. Therefore our suggestion to extend the Standard Name descriptions should draw the attention of your hypothetical data manager to the fact that biological_taxon_name and scientificName are the same thing.

I'm not sure who is watching this thread, but having been through the Trac 99 debate I know there are people who would voice strong opinions against your suggested Standard Name label change. CF is a standard based on physics that requires other domains to explain themselves with great clarity. As a specialist in biogeochemical semantics this is something I've learned over the years!

timvdstap commented 3 years ago

Interesting discussion. It would of course be great if CF and DwC can synergize where possible, allowing a data provider to be both compliant with CF and DwC when providing a single term/column (instead of having to duplicate information). Following this thread to see what comes out of it! 👍

JonathanGregory commented 3 years ago

Dear all I appreciate this thoughtful discussion and I agree with @roy-lowry that it's important to make the right choices carefully. I am one of those who would object to this proposed standard name change on the grounds already stated! While I'm generally against redundancy, I think the bad sort of redundancy occurs when things are said in two rather different ways that can be inconsistent without its being obvious. It is a less dangerous sort of redundancy when a given piece of information is repeated exactly the same. Therefore I think it wouldn't be bad if the CF biological_taxon_name and Darwin Core scientificName were both attached to a data variable with exactly the same value. This is easy to check automatically for consistency, by eye or by machine. Jonathan

roy-lowry commented 8 months ago

From discussions at 2021 CF and a subsequent Zoom meeting it emerged that biological_taxon_name and Darwin Core scientificName are not exact synonyms (the latter is broader because it doesn't require association with an identifier). Likewise biological_taxon_lsid is broader than scientific_name_id because there are other identification schemes. This relationship has been documented in the Standard Name description.

It also became clear that CF - a standard developed for global climate model data and based on text-unfriendly NetCDF - might not be the most appropriate standard for low-volume biological datasets that are usually handled in spreadsheets. Should the use case for going into CF be strong enough then they can be accommodated, but not as easily as encoding into Darwin Core. The sorts of biological dataset well-matched to CF are high volume data like model output, satellite images and data syntheses.

Consequently, it is proposed that no further action be taken on this ticket.

JonathanGregory commented 8 months ago

Thank you, @roy-lowry for this useful summary. If no-one disagrees within the next three weeks (before Tuesday 23rd January) this issue will be closed, and labelled agreement not to change.

Thanks to all who contributed to the discussion. In particular, thanks to Abby Benson @albenson-usgs for raising it. Abby will be added to the list of contributors to the conventions.

Happy New Year

albenson-usgs commented 8 months ago

From my perspective this ticket can be closed.

JonathanGregory commented 8 months ago

From my perspective this ticket can be closed.

Thanks, Abby @albenson-usgs

davidhassell commented 8 months ago

Thanks for resurrecting this issue - I also agree with the resolution.