preliminary CM requirements for the DINA 'taxonomy resource'

heathercole commented 3 years ago

CMs met for over 2 hours this week to discuss the 'taxonomy resource' requirements. As development is moving so quickly, I wanted to share the overall outcome as soon as possible. I would be more than happy to break this up into more actionable pieces when we know better which direction this is going.

@dshorthouse @cgendreau @ssbilkhu

please find a Specify Demo video (link) and CNCdb screenshot attached.

here is the link; I did it through a teams app (sorry for the emails, it was weird), please let me know if there is any issue accessing it https://web.microsoftstream.com/video/5f102a95-5113-43ea-850e-562ecc3d69a1 also in KW (nope, KW won't stream this format, sorry)

In the video i demonstrate some of the core functionality required in the DINA Taxonomy 'resource'. Specifically, the requirement for users to have very efficient access to edit/manipulate/view taxon records:

merge
move
synonymize
delete
edit

In addition to the functionality demonstrated in the video, CMs require an informative view of their collection holdings as it relates to particular taxa. This includes a break down of holding by country, as well as an indication if there are type specimens, as well as other data that is relevant to particular collections (eg. the month-collected bar charts in the CNC screen shot.

As noted in the demo video, it would be optimal to have informative fonts/icons in the higher tree "view" which indicate the collection holdings without having to open the taxon record. For example, perhaps bold font to indicate there are physical holdings for that taxon, or small icons that could indicate

images available
Type material
Canadian material
a count of related holdings.

CMs believe that it would be most effective to have 1 tree, the same taxonomic record could be linked to an identification in DAO OR to a CNC record (insect found on plant) or to a host in CCFC/CPVC/GINCO.

However, no one will want to see the whole tree all the time, so different views may be required, with managers choosing their default view of the tree, but to be able to select alternate view if needed (eg. only show me taxonomy for which I have material).

Related, is that there will need to be different permissions on different sections of the tree, so that a CNC student is not editing DAO data. This also relates to the requirement communicated in the demo for needing a flag on taxa added by certain user groups (eg. students).

As noted in the video, taxonomy at all ranks will need to be supported, including hybrids and indications for 'cultivars'.

There is also the requirement to be able to use flag/tags/checkboxes to identify particular categories at the taxon level. In some cases, these will need to be visible on the MaterialSample/CollectionObject data views. Some are documented here #90 some are visible in the CNC screenshot. Each manager will need to declare what is needed.

CNC Taxonomy

dshorthouse commented 3 years ago

As a cautionary note, I once built a tree editor with two of the top developers in our community while we were both in Woods Hole, MA employed by a GlobalNames NSF grant. We also engaged professional developers in Boston for parts of it. The code still exists & if need be, I can walk people through it https://github.com/GlobalNamesArchitecture/GNITE. Although that's 10+ years ago now & front-end technologies have advanced considerably, the logic to execute functions like drag/drop/edit/delete for nodes in a tree in a browser have not changed much since then. The processes & the PUT/POST/DELETE API calls + javascript libraries are much the same.

THE MOST challenging parts of this - & I cannot emphasize it enough - , are merge & maintenance of synonymy relationships in the face of editing. Merge being the hardest nut to crack & synonymy presents its own challenges. Linnean? Phylogenetic? Fuzzy matching? Scoring likelihoods? Uninomials/binomials/trinomials? Sensitivity to ranks? Authority? Name parsing? Botanical/Zoological vocabulary of synonymic relationships? Nested sets or just parent-child (i.e. performance, important if hybrids have multiple parents)?

We accommodated multiple trees in this utility through imports of Darwin Core Archive classifications because we recognized immediately that a single, consensus-based hierarchy is a nice ideal but challenging in practice; it requires a staging area to draw-in comparator tree(s) such that you can then perform drag/drop from source to destination with confidence. Doing this solely in a single tree opens the door to irreparable catastrophe. There is nearly always a need for more than one tree. A storage hierarchy vs. a taxonomic hierarchy is one example. Then you need mechanisms to align the two, eg "filed as this, determined as that vs known as this". We also had undo and redo. That was fun. And, we also recognized that per-node roles & permissions were near impossible to execute with any fidelity because these restrictions must propagate down the tree, reaching nodes whose asserted roles/permissions are then in possible conflict with other traversals in ancestry. The taxon concepts implicit in the name of a node whose labels can be edited also messes with the dream of having roles & permissions. The development of this took 2.5 years of solid work & it was cutting edge at the time (i.e. web sockets to eliminate race conditions between server & all simul-active front-ends plus message queues). Two and a half years. The bulk of that work was on the logic for merge functions & their inevitable need to resolve conflicts when nested child nodes risk being misaligned. You could naturally argue that a tree editing interface for classification(s) is overkill. And it is. However, it was merely a means to an end. The end being a maintenance classification in the face of external uncertainty however presented: flattened-out, partial or an entire view. Any view is always the easy part as are flags/tags on nodes.

So, knowing that we have limited time left to complete ALL of DINA & not just this taxonomy module, we will have to find compromises. Merge & node-level permissions are exceedingly complex, prone to a spectacular array of trickle-down issues. The more you pick at these particular wishes, the more they become really, really scary and hard to implement.

dshorthouse commented 3 years ago

Here a more recent Ebbe Nielsen Challenge winner a few years ago with the basics of what me an team had 10+ years ago: https://gitlab.com/tomarashish/taxonomyCompare

heathercole commented 3 years ago

this requirement does NOT include the need to manage multiple taxon concepts for any given taxon. It was identified as relevant to not have the same taxon represented in multiple 'places' in order to ensure relevant relationships between taxa among collections. If there are alternate ways to accomplish that requirement, it is certainly worth discussion/review.

It is understood that there are lots of different ways that the required functionality can be met. The only functionality described in the overview above that is not already available to collection managers are tools to more effectively search/review the taxonomy resource for accepted names and other taxon-status (eg. restricted/CITES).

Here is how this requirement has been formally documented as a DINA requirement; Collection managers must have the ability to manage taxonomy entries related to their holdings with a robust taxonomy database/tree/resource which: a. connects to specimen records (eg. species name determinations/host/substrate) b. support bulk name imports c. editable (spelling AND hierarchy; eg drag/drop options for move/merge) d. bulk edits must be possible e. support synonymy f. maintains correct nomenclature rules g. can be exported h. can be checked (not necessarily edited) against external lists to flag name biocontainment, status, restricted

CMs will be happy to discuss provide feedback if there are alternate implementations of these requirements to review.

Related, in case in isn't known, Specify is open-source Hard-install (Specify6): https://www.specifysoftware.org/products/specify-6/ Soft/browser implementation (Specify7) : https://www.specifysoftware.org/products/specify-7/

dshorthouse commented 3 years ago

a. connects to specimen records (eg. species name determinations/host/substrate) b. support bulk name imports

What format? Flat list? Hierarchical? Inclusive of synonymy? Linnean? Phylogenetic? With named ranks?

c. editable (spelling AND hierarchy; eg drag/drop options for move/merge) d. bulk edits must be possible

Can you elaborate what a bulk edit in a taxonomy might look like?

e. support synonymy

Can you please enumerate the vocabulary of relationships, which might be constrained by Kingdom, and which might require special handling or additional logic?

f. maintains correct nomenclature rules

Can you please enumerate these for all Codes?

g. can be exported

In what format & what structure?

h. can be checked (not necessarily edited) against external lists to flag name biocontainment, status, restricted

Do those external lists have web services? What are the inputs? Do they do any reconciliation at their end?

heathercole commented 3 years ago

yes, exactly; using the list as a starting point, more discussion is needed.

heathercole commented 3 years ago

the requirements can broadly be summarized as a hybrid between the functionality available in Specify (eg. review demo video linked above) and the informative display features in the CNCdb.

Different managers will likely apply different rules/codes/standards to the parts of the resource relevant to their collection. A single source will not be possible.

As there are so many factors to review, a meeting on this topic would most likely be the most efficient way forward.

related to: https://github.com/AAFC-BICoE/dina-planning/issues/163 https://github.com/AAFC-BICoE/dina-planning/issues/90 (eg. external checks based on taxonomy and special status (sensitive/restricted/CITES) https://github.com/AAFC-BICoE/dina-planning/issues/89

cgendreau commented 2 years ago

CNC Checklist will be published to GBIF to make it available in DINA. We are only waiting on metadata from management.

All the other collections can use taxonomic information for determination within https://data.catalogueoflife.org/ that is already available in DINA. The missing parts are:

an optional dataset selector for specific dataset like CNC #229
search to find all records according to X (e.g. all records from a specific family according to a specific Checklist) #230

If some taxa/names are missing, a checklist can be published with the missing/wanted content as CSV (assistance will be provided). Catalogue of Life understands the nomenclature rules, supports synonymy and allows collection managers to browse the taxonomy tree.

Browse taxonomy in Catalogue of Life plant taxonomy:

https://data.catalogueoflife.org/dataset/2344/taxon/P

Browse a specific checklist:

VASCAN: https://data.catalogueoflife.org/dataset/2012/classification

ChecklistBank (part of Catalogue of Life that accepts all GBIF’s datasets as Checklist) with the addition of specific Checklist when required fulfill all the requirements related to Determination (collection management part of taxonomy).

heathercole commented 2 years ago

It isn't clear that the above update addresses the requirements/functionality requirements documented by CMs. With other functions; there is the requirement for CMs to curate their own taxonomy resource with 'live' (immediate) add/removal of species names. It is not feasible to have to do separate submissions of names to an external service whenever changes are needed, and there needs to be the capacity to remove names to avoid errors and clarify expected names. Tests with some of the current functionality does not address this.

It isn't clear how this approach will address bulk upload of data with new taxonomy, or with taxonomy with multiple occurrences in the resources being queried.

The core/preliminary requirements documented in this issue reflect current mandatory functionality available in current management systems that needs to be available to collection managers for effective management of their data.

Hopefully these required functions can be demonstrated soon.

Thanks.

AAFC-BICoE / dina-planning

preliminary CM requirements for the DINA 'taxonomy resource' #188