CatalogueOfLife / coldp

30 stars 11 forks source link

Add Taxon.sequenceIndex #42

Closed mdoering closed 3 years ago

mdoering commented 3 years ago

Add an optional Taxon.sequenceIndex field to support a taxonomic ordering of children with the same parent. It should be integer values only using a natural ordering, i.e. the smaller ints come first.

See https://github.com/CatalogueOfLife/backend/issues/906

mdoering commented 3 years ago

In ITIS there is apparently a similar field called taxonomic_units.phylo_sort_seq Maybe it would be good to indicate by the name already the taxonomic/phylogenetic sort ordering intention?

jliljeblad commented 3 years ago

In Dyntaxa we've added in a comment in case you want to tell in what way the children are sorted. I find a lot of non-experts think there is a nature given systematic sort order, when in reality there are several ways to go about.

mdoering commented 3 years ago

is sequenceIndex the best name for such a term? Could be confused with gene sequences. maybe siblingSortingIndex as the sequence index has to be sorted only for siblings. The index does not need to be globally unique, i.e. the family F1-10 under order O1 can have index 1-10, but the families F11-20 under order O2 could have the same indices 1-10. They are only used for sorting the children that share the same parent. Which brings me to childrenSequenceIndex, childrenSeqIndex or childrenSortingIndex?

dhobern commented 3 years ago

If the scope is purely below a single parent node, siblingSortingIndex is unambiguous. I would take sequenceIndex to refer to a numerical labeling of all nodes in the classification of the kind that allows any subtree to be retrieved using just the start and end indices associated with the chosen root node.

mdoering commented 3 years ago

would you think it is feasable to provide a global sequence index? It seems more difficult to produce, but nested set indices for example provide that potentially out of the box. A global index can also be used just for siblings, but the reverse is not true. A simple spreadsheet does give you a global index though through its native order - maybe it is not that difficult to produce after all and it becomes more useful if you can sort any descendants, not just the immediate children.

I am inclined to require a global index that allows NULL for unsorted records, e.g. if you only sort the higher classification but keep species out of the game.

dhobern commented 3 years ago

I'm quite a big fan of simple approaches like this to optimise use of trees. In my own taxonomic tree I use to manage images of specimens ( https://stangeia.hobern.net/araba-bioscan-specimens), I have the taxon nodes associated with a start and end id (which are identical for nodes with no children). This makes data lookup so easy ... E.g.

Donald

--

Donald Hobern / @.*** / +61 420511471 Araba Bioscan Project https://stangeia.hobern.net/araba-bioscan-project/ / Pterophoroidea https://pterophoroidea.hobern.net/ / Alucitoidea https://alucitoidea.hobern.net/ / BOLD Australia https://bold-au.hobern.net/ ORCID: 0000-0001-6492-4016 https://orcid.org/0000-0001-6492-4016 / Blog https://stangeia.hobern.net/ / iNaturalist https://inaturalist.ala.org.au/people/dhobern / Flickr https://www.flickr.com/photos/dhobern// GitHub https://github.com/dhobern / Twitter https://twitter.com/dhobern

On Tue, 20 Apr 2021 at 21:02, Markus Döring @.***> wrote:

would you think it is feasable to provide a global sequence index? It seems more difficult to produce, but nested set indices for example provide that potentially out of the box. A global index can also be used just for siblings, but the reverse is not true. A simple spreadsheet does give you a global index though through its native order - maybe it is not that difficult to produce after all and it becomes more useful if you can sort any descendants, not just the immediate children.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/CatalogueOfLife/coldp/issues/42#issuecomment-823183612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGHP4ZXHJP5OYWYATCCQFETTJVNLHANCNFSM4TTA3JAQ .

mdoering commented 3 years ago

well, one disadvantage of a single, global sequence is that when we assemble the tree from various sources we cannot simply copy the sequence of the sources. In fact it will become rather difficult to merge sequences from hundreds of sectors into a single one and it would force us to recalculate and update the sequence for the entire dataset whenever we just add a single new name. That makes it very impractical to use for COL.

Seems we would be better to require unique keys only for siblings - if someone has a global sequence thats still fine. And we can probably find ways to allow skipping of intermediate ranks while preserving the ordering of the skipped ranks, e.g. add a certain start value to each children group that are now siblings to make sure they sort like their skipped parents.