AAFC-BICoE / dina-planning

AAFC-DINA planning repository
3 stars 2 forks source link

DINA data management module must support taxonomy for hybrids, cultivars and varietals AND discussions on requirements for 'identification' fields #163

Open heathercole opened 3 years ago

heathercole commented 3 years ago

GIVEN I have accessed DINA as data manager

WHEN specimens have determinations/taxonomy which include hybrids, cultivars and varietals

THEN the taxonomy fields in the DINA data management module must support entering these values correctly into structured fields

AND support associated information (eg. parent species of hybrids)

Attached is a related screenshot from Specify, but @shannonasencio should be consulted as for full requirements

dshorthouse commented 3 years ago

There's some jumbled terminology and concepts here. Determinations and taxonomy are two different data types and ought to be handled in distinct, non-overlapping ways. First thing to address here is why you would want structured fields (which are what exactly?) for the scientific names in determinations vs structured fields for the scientific names in a managed taxonomy or a classification.

michellelocke commented 3 years ago

I think this is mostly in relation to the taxonomy database and the taxon field. I presume determinations are going to be a text field that will not be linked to anything. I think it is just use a misuse of the word determinations here. We are talking about structured fields and a managed taxonomy module.

For CNC we have some hybrids, but these are not common. We do not have varietals or cultivars. Hybrids are most commonly found in Lepidoptera (butterflies). It would be best to consult some of the experts who signed up to be consulted for Taxonomy in our original workshops in 2019 and ask them how to properly deal with taxonomy issues as well as Collections staff. I believe that Jean-Francois Landry signed up to be consulted on Taxonomy and as he is one of our Lepidopterists it would be good to ask him how best to deal with hybrids in the Entomology Taxonomy Database.

dshorthouse commented 3 years ago

+1 for determinations as unlinked textfields. Let's please not over-engineer a relationship between the components of a det and those of taxonomy as is often attempted (and fails miserably) in collections management.

The scientific names on a det. should be free-form, not in separate fields like Genus Specific Epithet, etc. & not directly linked to the scientific names in a consensus-based taxonomy/classification. As such, there is never any problem with inclusion of whatever var, f. or hybrid symbols required because this is what the determiner has written & the parts of the name are not parsed out for storage. The association between the concept of the taxon as represented by a det. to that of one in a taxonomy/classification is a secondary process much like what we've done with agents. If you want to see why a freeform textfield is useful for dets. & placement in a hierarchy a secondary process, look no further than older type specimens whose taxon has since been synonymized.

Structured fields are often useful when you're building a consensus-based classification & need to care about parent-child relationships, but these have no direct bearing on determinations. And so, this ticket needs to be split into multiple parts because it is over-reaching in its intent. We need a ticket for dets that rest within catalogued objects / physical entities and several more for scientific names & concepts with possibly multiple classifications that rest within Taxonomy.

rintoult commented 3 years ago

Fungal Taxonomy Ranks below species we will need to be confirmed with Mycology group: variety - var., subvariety-subvar., subspecies- subsp., forma- f., species complex, forma specialis- f.sp., conferre- c.f.

shannonasencio commented 3 years ago

@dshorthouse What is the purpose of building/maintaining a taxonomy table if not for linking to specimen records as determinations (and literature citations)? I have strong feelings against free-form entry for taxonomic determinations. It is slow, but most importantly allows for the avoidable introduction of human error (thus affecting the queryability of records based on identification, which is the most common search parameter for specimen records). Also of tremendous importance is the ability to capture synonymy in the taxonomy table, thus allowing for the return of all applicable specimen records in a query. If a determination does not exactly match a published scientific name, that should be captured in a determination remarks field in the specimen record.

As for handling infraspecific ranks, there are solutions that do not require direct data entry on specimen records as they should be handled in the taxonomy table. You just need fields for them. What should be handled in the determination portion of specimen records are qualifiers (e.g. species suffix = s.l.; species prefix = cf.) as those are specific to individual specimens.

This warrants a discussion at one of our meetings. The development team should not be making decisions of this nature on behalf of collection managers.

dshorthouse commented 3 years ago

@shannonasencio We're on the same page here, but we differ in implementation. This will be one of the more ropy aspects of DINA and does deserve much discussion. Recall that the Taxonomy working group during one of our workshops agreed that determinations ought to be drawn from a big bag of flat scientific names & that classifications/taxonomies are higher-order concepts. We could also state that collection managers should not be making decisions on behalf of the taxonomists that will be using DINA. There is a big difference here between managing a collection & faithfully representing the scientific names associated with determinations vs. the responsibility in managing and maintaining hierarchies and their taxon concepts. We'll need to reconvene our working group to nail down an operational plan that pleases all stakeholders.

heathercole commented 3 years ago

the priority of this data system is for the collection managers. It was indicated that later modules for research requirements and species concepts would be evaluated later.

Determinations are just one of the types of specimen annotations that need to be present in the system. Species names associated to collections records absolutely must connect to a taxonomy resource and not be stand-alone text values in a form. The taxonomy must support complex species with multiple hierarchical levels which must be identifiable. When data is exported, the taxonomic ranks associated with a species' name must be clear.

heathercole commented 3 years ago

@michellelocke can you clarify "I presume determinations are going to be a text field that will not be linked to anything". I thought you had indicated that when data entry happens, that you search existing names? and use 'working name' if the appropriate taxonomy is not already in the tree and needs to be added?

The idea is that (similar to geography), typos are reduced and data entry is optimized when users can pick from an existing list, rather than entering in manually each time.

dshorthouse commented 3 years ago

the priority of this data system is for the collection managers

Exactly. And this is why we need clear delineation between determinations (collection manager responsibility) and Taxonomy (unclear yet whose responsibility this is to maintain, how many of them there are & from where they are drawn). There can be very tight integration between these two to the extent that changing hierarchies and nomenclature in Taxonomy likewise cascades to all determinations. Or, these two can be loosely coupled, at arms length from one another such that determinations remain as static, unchanging entities.

dshorthouse commented 3 years ago

rather than entering in manually each time

There's no need for manual entry with determinations. They can be drawn from the names held within Taxonomy. The difference however is that having selected a name here, there is no hard, relational link.

heathercole commented 3 years ago

it is vital for effective collection management that connections between taxon names and specimen records can be made. If a species gets a new name, that information needs to be added to the system and connect to data records to which the new name relates. There must be connectivity at the species name level, as it would not be possible to update every related record manually.

there certainly seem to be options for implementation, but also relates to the requirements about tags and flags where there is also a clear requirement for species names/taxonomy to be linked to records in a way where species-status on protected and controlled lists informs collection managers at the level of the specimen record to which it applies.

https://github.com/AAFC-BICoE/dina-planning/issues/90

it is not acceptable for the species names in specimen records be text-fields only which do not connect to a taxonomy resource in some way.

heathercole commented 3 years ago

rather than entering in manually each time

There's no need for manual entry with determinations. They can be drawn from the names held within Taxonomy. The difference however is that having selected a name here, there is no hard, relational link.

there needs to be a link, so that if that species name is changed/synonmyized, the specimen record is likewise updated.

An example would be that if a new synonym is added to Taxonomy, any record with either the old species name or new name would be returned in a query. This is needed functionality that currently exists for managers.

heathercole commented 3 years ago

"taxonomy" cannot be simply a list. It must be able to maintain relationships (eg. synonomy and preferred names), it must be able to be editable, and it must be able to maintain a hierarchy for taxonomic ranks.

there must be rules that you can't add a species name (eg. species-rank) without associating it to a genus. Similarly, you can't have a subspecies, without a species, and formas may exist below different ranks.

eg. Taxonomy Ranks below species we will need to be confirmed with Mycology group: variety - var., subvariety-subvar., subspecies- subsp., forma- f., species complex, forma specialis- f.sp., conferre- c.f.

Several collections already make use of similar functionality, and would not be able to work efficiently in a data management module without some relationship between collection records and structured taxonomy. It would be a 'deal-breaker'

michellelocke commented 3 years ago

@heathercole We are confusing two different fields/types of fields here. I'm using terminology that the CNC uses, but when I read @dshorthouse's first comment it sounds to me like he is using similar terminology.

1) Taxon: This indicates what taxon the specimen is and is linked to a Taxonomy Table/Taxonomic hierarchy and will help reduce spelling errors. In CNC we have a separate database or module to manage names, and they are linked to specimen records. If a name is synonymized in the taxonomy module/table/database then that will be automatically reflected in the specimen records in the Taxon field. 2) Determination: This is a text field that captures the verbatim determination information on labels (ex: Dasysyrphus limatus, Det. M.M. Locke, 2021). This can house spelling errors as it is verbatim and will indicate exactly what the expert wrote on the label. There can be multiple Determinations as there can be multiple determinations on a specimen.

I think this terminology is what is causing the majority of the confusion in this discussion and we are just using terms differently. I have faith that DINA will be able to handle both a Taxon field linked to a taxonomy tree and Determinations, which are basically verbatim of labels.

There are some unique issues with taxonomy for each of our groups, including ranks. I'm sure those will all be dealt with and it is good to highlight some of those needs. As our taxonomists are experts on the unique issues of each groups taxonomy, I would hope that they would be the ones consulted, or at least added into the consultation process. I'm sure Collections Managers are quite well versed in taxonomic concepts, some may even be experts as well, but we need to be including the taxonomists in this process. This database is for more than just Collections Managers and the Taxonomy component is a vital part of Taxonomists' work (they are the ones who put names on the specimens we are trying to manage).

dshorthouse commented 3 years ago

There's no need for manual entry with determinations. They can be drawn from the names held within Taxonomy. The difference however is that having selected a name here, there is no hard, relational link.

there needs to be a link, so that if that species name is changed/synonmyized, the specimen record is likewise updated.

I think we're talking cross-purposes here. Under no circumstances should a scientific name on a verbatim determination be changed on behalf of the determiner through change to a scientific name held in Taxonomy. Scientific names in verbatim determinations must always remain static because they are versions of record. What we have to come to grips with here is what, if any, is the full suite of fields we want in verbatim determinations.

There are pros and cons to either a single field for the scientific names in verbatim determinations as @michellelocke describes or a set of fields to contain the parsed bits for those scientific names in "verbatim" determinations. If parsed into separate fields, there is an inclination to make such hard, relational links to names (at whatever hierarchical level) held in Taxonomy as a way to facilitate data entry and that's where we get ourselves into trouble. Such pick-lists for genus, species, subspecies, etc. while mentally deconstructing a scientific name on a verbatim determination is seemingly useful in the short-term but almost always get ourselves into trouble because we confuse the version of record with the placement in a hierarchy. Hierarchies, synonymies, and concepts experience independent flux relative to specimen records and their verbatim determinations. And so, we need a mechanism to disentangle the two such that neither workflow (recording verbatim determinations or updating the Taxonomy) directly affects or impinges on the other in a way that would do harm. A deconstructed scientific name on a verbatim determination then also needs to be reconstructed if we want to share our data with GBIF while simultaneously representing what the determiner wrote, assuming we'd want to share determination histories as is commonly done. That logic can be a nightmare unless we accurately encode all the necessary rules of nomenclature from all the Codes. Not an easy task, requiring years of development, not months.

Clearly, we need a way to streamline data entry on verbatim determinations and reduce the error in transcription. But, we also have to question whether or not it's the transcribers responsibility to simultaneously make functional, asserted linkages to the concepts held within Taxonomy, which is what is actually taking place if parts of names were chosen from pick lists while entering data about the scientific names in those same determinations.

Again, verbatim determinations and a specimen's relational links to one or more entries in Taxonomy ought to be distinct such that we have the freedom to update/change the latter without breaking what the determiner intended. We should have the capacity to change a specimen's link to Taxonomy without changing the content of any of its determinations. And, we should have the freedom to enter a new determination without having to first make new entries in Taxonomy (eg a det that has a vernacular name). There are also real-world examples where a specimen could have a single verbatim determination and yet can also be validly linked to more than one leaf node in Taxonomy (eg ambiregnal taxa).

heathercole commented 3 years ago

What I meant by "determination" was the group of data fields which relate to the representation of a particular type of annotation. A determination includes species name, a determiner, date, remarks and several fields relating to taxonomy.

If it more clear to say that the "taxon" instead of "determination" that is associated to a specimen record is what needs to link to a taxonomy resource that can be maintained with synonymy, flags, tags requirements, then that is fine, but there needs to be informative links between the taxonomy resource and specimen records which include more complex functionality than "point/select".

heathercole commented 3 years ago

image

heathercole commented 3 years ago

image

dshorthouse commented 3 years ago

A determination includes species name, a determiner, date, remarks and several fields relating to taxonomy.

What are the "several fields relating to taxonomy", please? What you show here from Specify is an entry in Taxonomy and one for a determination, but I do not see any of the former embedded in the latter. They are nicely and cleanly separate, assuming the "Taxon" string in a determination (your "species name") does not arbitrarily and opaquely change with adjustments in Taxonomy. If it does, then this is a severe design flaw in Specify.

heathercole commented 3 years ago

What I meant by 'several fields relating to taxonomy" are the type status, as well as qualifier, and addendum (related to confidence of the determination). These fields relate specifically to the species name/taxon being assigned with the determination. I perhaps should have just said "and other fields relating to the determination".

In the case of specify, each species name has a record in the taxonomy "tree" as shown by the first screenshot. The "Taxon" field directly links to the structured taxonomy "tree", so if a record is changed/updated in the tree, all the records which "point" there ARE subsequently updated.

I am not sharing these screenshots saying they are the requirement, only as an example of related functionality.

(edit; this may not represent all the related requirements for all determination data fields), CMs will be happy to provide those whenever requested)

heathercole commented 3 years ago

image

dshorthouse commented 3 years ago

What I meant by 'several fields relating to taxonomy" are the type status, as well as qualifier, and addendum (related to confidence of the determination). These fields relate specifically to the species name/taxon being assigned with the determination. I perhaps should have just said "and other fields relating to the determination".

OK, phew! These are more nomenclatural than taxonomic. I also see here that a scientific name (called a "Taxon") in your screenshot of a determination is NOT parsed into bits. It's a full string and includes things like hybrid symbols. All good and exactly how we'd expect.

In the case of specify, each species name has a record in the taxonomy "tree" as shown by the first screenshot. The "Taxon" field directly links to the structured taxonomy "tree", so if a record is changed/updated in the tree, all the records which "point" there ARE subsequently updated.

This is where I'm shocked. Do you mean to say that change in a Taxon entry (eg suppression of a hybrid designation) will propagate to all determinations' "Taxon" entries such that it will newly appear as if Donald Britton applied this now edited scientific name? What happened then to the scientific name he originally applied to the specimen?

heathercole commented 3 years ago

"Do you mean to say that change in a Taxon entry (eg suppression of a hybrid designation) will propagate to all determinations' "Taxon" entries such that it will newly appear that Donald Britton made that determination? What happened then to the scientific name he originally applied to the specimen?"

This would depend on many different things. "Determiners" are notoriously lazy when it comes to including the appropriate information associated to a scientific name (eg. species name authority). On a label; they may abbreviate, or leave it out altogether. In this case, many managers would opt to make an assertion relating to this (similar to correcting typos). OR another use-case; if a student entered 100 records all the with same taxonomy, and that included a typo, you need to be able to fix them all at once (not manually one-by-one).

Alternately, this relates to the requirement for links/relationships to maintained taxonomy. If there was a "brand new name" applied to an old name, that name could be added to the tree, and the "old name" linked to it, identifying the "new name" as the preferred. It would NOT change the data record, but would then enable the functionality where a query on either name returns the related record(s).

This also relates to how different collection managers choose the most efficient ways for each of their collections. CNC maintains a very strict "tree" focusing on "accepted names" (but needs a field to capture/record names that aren't in there yet), in other cases, the name on the label may be the "truth", but perhaps only until association/assertion to "accepted names" is possible. This may also relate to what permission are related to who can add names, which may vary by collection as well.

heathercole commented 3 years ago

OK, phew! These are more nomenclatural than taxonomic. I also see here that a scientific name (called a "Taxon") in your screenshot of a determination is NOT parsed into bits. It's a full string and includes things like hybrid symbols. All good and exactly how we'd expect.

Specify has each 'piece' in its own structure, where the standard 'view' is called "full-name". The way the "full-name" is displayed is based on system defined (but user-customizable) "rules" relating to proper display of the full-name (eg. Genus species ssp. subspecies)

With current settings, Specify is 'told' to NOT show family (or higher) names with the 'full-name' display, although the ranks are part of the 'tree'

image

also, you can see an issue, 'cultivar' had to be included in 2 spots, because [nomenclaturally], the rank can be used in different positions within the hierarchy

michellelocke commented 3 years ago

@heathercole I would respectfully disagree. I do not want to change typos, or expand short forms in my determinations. These were written by a professional and the idea is to capture exactly what they wrote, even if incorrect. A transcriber may think that they have come across incorrect information (this is not only students, but techs and CMs) and want to "correct" it, but only an expert who is doing a revision on that group may understand that the thing the transcriber perceived as a mistake is actually important data. It is vital that we preserve the determinations as is. This is also important to show a specific species concept at a specific time. As concepts change we need to preserve what it was determined to be at that time. This means that if a species is moved to a new genus in 2020, all older determinations would retain the genus that species was in at the time they were made. No linking is done here (this is done elsewhere in a Taxon field). These point to very specific concepts of species and by updating the taxonomy of determinations we are erasing that information. I rarely make any assertation when transcribing a determination label into a record.

I'm open to seeing other ways of handling Taxon and Determinations other than how the CNC handles this but the linked dynamic names (Taxon) are very different from static determinations.

heathercole commented 3 years ago

@michellelocke, as noted above, different collections will have different approaches to this. It seems like there will be some in-depth requirements gathering on this topic.

It will be worth clarifying usages between 'verbatim species name from the label' and/or 'related taxon-names' and if CNC wants to export this type of data with known typos vs. accepted names.

I think this is a case of the same terms being used with different meanings, it will certainly be worth clarifying the related functionality requirements for all collections.

heathercole commented 3 years ago

will perhaps be relevant to use the DwC terms here referring to these as 'identifications' (instead of determinations)

With DwC, "identificationRemarks" may be the most relevant field to record corrected typos or missing info (like authority name) where someone has made an assertion.

https://dwc.tdwg.org/terms/#identification

DwC also differentiates between the identification fields which include the 'other' fields I mentioned above vs. the "taxon" fields relating to management of taxonomy https://dwc.tdwg.org/terms/#taxon

dshorthouse commented 3 years ago

I'd also add to @michellelocke's responses that these determinations are often hand-written & so transcription into a verbatim determination field is itself an assertion – all the more reason not to constrain entry through a choice (or choices) from pre-existing scientific names in a managed classification. There are countless examples of specific epithets with slightly variant spellings under the same generic vehicle & if our managed classification is not comprehensive, we're obfuscating identity for convenience-sake and we're the cause of bad science. What we need to aim for here is as much transparency as we can muster while faithfully representing what we see. If what we enter is different from what we see or what we've been given (or later opaquely & destructively drifts away from these), this is a show-stopper.

heathercole commented 3 years ago

@dshorthouse I think this one went off track, I reviewed, but i think the discussion here is covered elsewhere, and could be closed.

heathercole commented 2 years ago

How will the current 'auto-complete' for association to scientific names handle hybrids and cultivars? I don't see a place to test in the current test implementation? where will those names be stored to be re-used?