Merge taxon & usage for simpler sharing?

mdoering commented 5 years ago

For by far most publishers it will be unnatural to separate names and taxa with their own ids and reference between the two.

Discuss whether its worth sharing the data in 3 different files (names, taxon, synonym) or whether its better to pool all properties into a single name usage file like DwC and ACEF do. It would still be possible to have a nameID additionally to the taxonID (or then better called usageID) and even provide a usage/taxon status of none that marks bare names without a taxonomic assertion.

Name relations would then better make use of taxon/usageID. Sharing would become a lot simpler, but do we lose anything?

mdoering commented 5 years ago

Difficult in this model are split concepts / pro parte names, when the same name has mulitple usages. E.g. an accepted taxon and a synonym or several synonyms with different accepted names cause the species was split.

The unified usage model would create the same amount of taxa/synonyms as currently, but duplicate the name attributes. Not sure if that matters a lot

gdower commented 5 years ago

I think we should keep the schema as is and solicit feedback from the GSDs. In particular getting feedback from Species Files, WoRMs and ITIS might be really helpful if they are willing to work on producing CoLDP exports.

Retaining the GSDs' stable IDs is important. The same IDs can still be used for names and taxa when GSDs don't have separate IDs for taxa and names.

If anything, I would recommend modifying the CoLDP importer to allow a basic CoLDP submission format that would allow taxa, names and synonyms to be included in a flattened taxon table (while still allowing references to distribution, media, vernacular names, and descriptions). We could provide different templates for either basic or advanced CoLDP submission, and give the pros and cons for the two CoLDP submission options.

ayco-at-naturalis commented 5 years ago

Agree with Geoff and currently not much to add to it

mdoering commented 4 years ago

This is a very key question before we can release ColDP and one that in my mind is still open. When thinking about offline editing the slit of names and taxa/synonyms is a real blocker. We lose all humans and make ColDP mostly a machine format even though all other parts are designed to be human friendly.

If we'd merge name, taxon and synonym into NameUsage as it is in DwC or Plazi treatments I do not see any critical downsides. We would:

use the full list of all taxonomic status incl synonym status and make that required
use parentID for synonyms to point to the accepted and for taxa to point to the next higher taxon in the classification
allow and recommend multiple values in parentID for synonyms so we can model pro parte synonyms pointing to multiple accepted names without the need to duplicate records

What do we lose?

gdower commented 4 years ago

A big thing we'd lose at this point is the existing work to convert data sources to CoLDP, which might frustrate people that have already put in the effort to switch. It would be ideal if we can adapt the Taxon table to allow flatter submission, while still allowing the Name and Synonym tables, or possibly we should make a totally new format that is more human-friendly.

CoLDP might be a good exchange format between infrastructures (TaxonWorks :left_right_arrow: CoL Clearinghouse).

mdoering commented 4 years ago

It might be best to offer a name_usage.tsv file as an alternative option instead of Name, Synonym and Taxon. This would be a good candidate for offline editing. The fields would largely be just a union of the individual tables.

Not sure if the denormalized fields are needed in this case

mdoering commented 4 years ago

@gdower what if we go just with a merged NameUsage table in ColDP, but keep unofficial support for the 3 separated tables for some time? I feel we should rather avoid too many options

gdower commented 4 years ago

@mdoering, in my opinion, hiding complexity doesn't really simplify things and instead it just makes it harder to use the software because people won't be able to find the info they need. I might be able to create a simple flowchart with questions about how a contributor's data is organized that would direct them to the CoLDP format and specific documentation that best suits their needs. For some contributors, the 3 table approach might work best for their data model.

mdoering commented 4 years ago

@gdower thats Geoff, but why keep the separation of name & usage if noone makes use of it? Can you point me to one resource where it's truely better than having a single name usage entity? I feel we were way too theoretic or idealistic when designing it that way

gdower commented 4 years ago

The problem is that 3 table approach is already in use, and it will take additional work to switch to the NameUsage table. If we make CoLDP too much of a moving target by drastically changing it and not maintaining backwards compatibility, it might also discourage adoption.

TaxonWorks' and ITIS' data models are pretty similar.

jliljeblad commented 4 years ago

In Dyntaxa, we're theoretically distinguishing between a name and nameUsage table, but in practice they are merged in a 1:1 relationship. Instead, if the exact same name is used differently in another instance, we keep track of that by having a name relationship (is the same) that tells you that the two instances are using an identical name (but with different usages and potentially tied to another taxon). We did this, part because in our old system we already have the same name duplicated this way when tied to different taxa before and after a split. So because Dyntaxa is built with the taxon at its basis we needed to deal with this legacy "shortcoming".

mdoering commented 4 years ago

Thanks @jliljeblad, I found the same in GBIF datasets. Even in nomenclatural data it is based on usages of a name and the spelling might even slightly vary between the "same" name. Keeping the exact version of a name per usage therefore is important.

I also believe it is better to relate name usages, instead of forcing a name instance to be reused. The Clearinghouse has dataset "silos". Technically names cannot be shared between datasets and instead we relate them, saying 2 names are the same. Sharing the same name record is really only possible within a single dataset. And that is rather rare. The only real cases I know of are pro parte synonyms in which the same name has several accepted names. But even that could be dealt with in a merged NameUsage model. The same name usage simply has 2 accepted name usages.

The other case I know of is when sharing true taxon concept data where there are multiple, concurrent concepts for the same name. Or concurring classifications. We do not really have those datasets though in the Clearinghouse, but might get more into this in the future. Still I'm not convinced we would a) easily get data with normalized names and b) benefit from normalised names much. We can always just have multiple name usages based on the same name. That part worked fine in DwC I think.

@gdower as for stability we always said ColDP is not released yet. And when we do so in June it will indeed have to be frozen. So the upcoming 2 months are pretty much our only time to still adjust.

And we definitely want a merged NameUsage table so we can use ColDP for offline editing.

jliljeblad commented 4 years ago

@mdoering Oh, just realized that there's potential for a misuderstanding here. With nameUsage I mean just that, how a name is used in relation to a taxon concept, rather similar to the table you call Synonym but also including accepted such as: accepted, synonym, ambiguous synonym or misapplied.

mjy commented 4 years ago

I'm not worried about what the formats ultimately are, we can update what we export in TaxonWorks (where one instance currently includes > 400k taxa, names, name relationships, name statuses, and citations all in different tables, just for reference). If time is being taken to change format, and we are serious about justifying what format is landed on, then I think 2 things are needed, otherwise it's all she-said-they-said:

1) Formalized use cases that are shared with all, and that ultimately include details as to how the ColDP format answers the use case. I'm thinking of a format with three elements: as an X I need Y to do Z. I think Z is important, I caught myself not providing it below- it's the outcome that we need to focus on. 2) Unit tests driven in a CI environment. Given a ColDP format, and a script, I can answer a Use case.

I suspect you won't have time to do 2, so I think the focus might be 1. The following is a little thought experiment, with only the first use case my personal concern. I am writing this under the assumption that the ColDP format, as I understand, is to serve the Catalog of Life specifically (otherwise why not just use DWCA, ITIS, or some other format). If this is not the case, then I have no stake one way or the other.

Use case 1 - As a Catalog of Life builder I need a list of Taxa (not names) to compile into the Catalog

This is critical, the Catalog of Life, as I understand it, is first and foremost intended to be a list of Taxa (biological entities, i.e. hypotheses of circumscription). It is important to note that there are many "nameless" taxa out there, so basing everything on NameUsage seems to (greatly IMO) limit ultimately addressing this point.

Consequence: To nail down these semantics, and emphasize this point to users, data providers, etc., I feel strongly that one of the CSV files should be "Taxa", not NameUseages, not Names, not Synonyms. This drives home the point that Names are not Taxa, even if we often treat them 1:1. If folks want NameUsages in the strict sense why not Zoobank?

Use case 2 - As a Taxonomist I want to understand the nomenclatural synonymy of a Taxon so that I can track down the related original descriptions.

While I have heard at times that the CoL is "not for taxonomists", it seems clear that it should serve them.

Consequence - I need to know NameUsages and Synonymy for those names that are governed, not UUIDs, Common Names, temporary names, or any other type of string that might represent a "nameless" taxon. It seems some tables must be scoped or defined to address these distinctions.

Use case 3 - As a Researcher who does not understand Nomenclature I want to find the literature that treats my Taxon so that I can build a bibliography.

I can grok what a species list is, but what the heck is a "NameUsage"?

Consequence - I need to know my taxon, so I can start from one point, then trace outwards to find names, and citations (usages). It seems that providing end users "NameUsage" as the core may ultimately confuse, how will they know where to start? Of course, this question could possibly be addressed in post-processing/reports (unit test idea above) that provide the summary required.

etc. etc.

mjy commented 4 years ago

Caught this use parentID for synonyms to point to the accepted and for taxa to point to the next higher taxon in the classification. I'm not sure if I fully understand, but I suspect I do. This is a (very) bad idea in my experience. It leads to all sorts of persistence/interpretation confusion down the road. Use parentID as the classification heirarchy, use a seperate edge/id for synonymy. Don't overload meaning in the values. Keep data used in presentation (I want so see this name under that name, so parentID, for reasons), from the meaning of data.

CatalogueOfLife / coldp