classification bulkloader specifications (and hierarchical editor, maybe)

dustymc commented 4 years ago

The taxonomy classification loader is non-functional and needs rebuilt. How should it work, and what should it be capable of?

https://github.com/ArctosDB/arctos/issues/2592 is a request for multiple terms of the same type (or "rank"), which isn't compatible with CSV.

Phylocode and cultural classifications may both deal in unranked terms, which isn't very compatible with CSV.

The old/dead loader used CSV based on https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_term, and then died every time someone changed a term (which was often).

The hierarchical editor exports classification-loader CSV (a human review step was deemed necessary), which imposes "must have type/rank" limitations on the tool. Given the ability for a collection to prefer smaller "sub-classifications" it might be possible to wire the hierarchical editor directly into certain classifications, which might change the big picture.

I don't know how to reconcile the various needs into a single tool, or even precisely what the scope of that discussion should be. I think group discussion would be useful; can we get this on the AWG agenda?

Jegelewicz commented 4 years ago

Maybe we should start off discussing this in Taxonomy Committee? We are meeting on Wednesday at 2PM MDT and we could focus on this issue.

dustymc commented 4 years ago

Committee

I can make it if that would be useful.

Jegelewicz commented 4 years ago

Yes, that would be awesome!

dustymc commented 4 years ago

Meeting: for now, try to make something stable-ish that uses https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_term as headers

Unranked-ish taxa (eg mineral) will fake it with terms like rock_thing_1, rock_thing_2

Crazy idea: how hard would it be to get data into

scientific_name, noclass_term_1,noclass_term_type_1,...,noclass_term_10,noclass_term_type_10,class_term_1,class_term_type_1,...class_term_12,class_term_type_12

where

scientific_name is the key

noclass_term_1,noclass_term_type_1,...,noclass_term_10,noclass_term_type_10

are classification=no terms from the code table, for example:

noclass_term_1,noclass_term_type_1==>preferred_name,some value for preferred name noclass_term_2,noclass_term_type_2==>author_text,some value for author

and class_term_1,class_term_type_1,...class_term_12,class_term_type_12

are classification=yes terms from the code table, for example:

class_term_1,class_term_type_1==>kingdom, Animalia class_term_6,class_term_type_6==>NULL, SomeUnrankedTermInTheMiddleOfTheHierarchy class_term_7,class_term_type_7==>NULL, SomeOtherUnrankedTermInTheMiddleOfTheHierarchy class_term_10,class_term_type_10==>subspecies, Some sub species

The only real limitation would be on the depth of the hierarchy (or number of columns in CSV format), and that would be fairly simple to increase. This would handle

all ranked classification terms
all unranked classification terms
a mix of ranked and unranked classification terms

and provide a mechanism for multiple non-classification terms - 3 different values for taxon_status (https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_status), for example.

It would be possible to skip say xx_3 for a few records without superfamily data in order to keep all family in xx_4, which isn't QUITE as convenient as just having a column named "family" but perhaps close enough, given who uses this tool (very few) and how often (not much, and hopefully less given the new flexibility in source<-->collection links). The only important information in the column names would be "lower numbers are more parentish in the hierarchy."

Does that seem usable?

campmlc commented 4 years ago

Sounds interesting! Lets try it?

On Wed, Sep 16, 2020 at 6:58 PM dustymc notifications@github.com wrote:

[EXTERNAL]*

Meeting: for now, try to make something stable-ish that uses https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_term as headers

Unranked-ish taxa (eg mineral) will fake it with terms like rock_thing_1, rock_thing_2

Crazy idea: how hard would it be to get data into

scientific_name, noclass_term_1,noclass_term_type_1,...,noclass_term_10,noclass_term_type_10,class_term_1,class_term_type_1,...class_term_12,class_term_type_12

where

scientific_name is the key

noclass_term_1,noclass_term_type_1,...,noclass_term_10,noclass_term_type_10

are classification=no terms from the code table, for example:

noclass_term_1,noclass_term_type_1==>preferred_name,some value for preferred name noclass_term_2,noclass_term_type_2==>author_text,some value for author

and class_term_1,class_term_type_1,...class_term_12,class_term_type_12

are classification=yes terms from the code table, for example:

class_term_1,class_term_type_1==>kingdom, Animalia class_term_6,class_term_type_6==>NULL, SomeUnrankedTermInTheMiddleOfTheHierarchy class_term_7,class_term_type_7==>NULL, SomeOtherUnrankedTermInTheMiddleOfTheHierarchy class_term_10,class_term_type_10==>subspecies, Some sub species

The only real limitation would be on the depth of the hierarchy (or number of columns in CSV format), and that would be fairly simple to increase. This would handle

all ranked classification terms

all unranked classification terms

a mix of ranked and unranked classification terms

and provide a mechanism for multiple non-classification terms - 3 different values for taxon_status ( https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_status), for example.

It would be possible to skip say xx_3 for a few records without superfamily data in order to keep all family in xx_4, which isn't QUITE as convenient as just having a column named "family" but perhaps close enough, given who uses this tool (very few) and how often (not much, and hopefully less given the new flexibility in source<-->collection links). The only important information in the column names would be "lower numbers are more parentish in the hierarchy."

Does that seem usable?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3110#issuecomment-693744146, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBBIPDKMH7SM3LDOBNLSGFNM3ANCNFSM4RLZJDKA .

Jegelewicz commented 4 years ago

It would be possible to skip say xx_3 for a few records without superfamily data in order to keep all family in xx_4, which isn't QUITE as convenient as just having a column named "family" but perhaps close enough, given who uses this tool (very few) and how often (not much, and hopefully less given the new flexibility in source<-->collection links). The only important information in the column names would be "lower numbers are more parentish in the hierarchy."

Would you HAVE to skip anything? If we are assigning a term type (Family), then does it matter what level of the hierarchy it is in? Searches could then be across the whole damn thing for any term types = Family plus whatever "family" you wanted? Is that a thing?

Jegelewicz commented 4 years ago

Could we do this in test any time soon?

dustymc commented 4 years ago

HAVE to skip anything?

No. I believe this would support the full Arctos model of "whatever you want wherever you want it" but at limited depth. Fill in every cell, every cell but the ranks, just xx_86 (which will end up at the top of the hierarchy) and xx_100 (child of xx_86), whatever.

does it matter what level of the hierarchy it is in

Very much no. Family-->kingdom-->family-->NULL-->family-->subspecies-->family fits in both the core Arctos model and this. That's necessary in the core model - I'm no taxonomist, if one says it's valid (eg, by loading their data to GlobalNames), who am I to argue? I'm not so sure it's desirable in a local management tool, but we're careful who has manage_taxonomy so....

test any time soon

I think so - as long as this is just feeding Arctos and not trying to be something it can't (eg, enforce consistency or avoid 12 family assertions in a single classification) it should be pretty fast to develop.

dustymc commented 4 years ago

Sounds like "go" to me....

I added https://handbook.arctosdb.org/documentation/taxonomy.html#usage-of-sources in part to see if I could find a reason that this approach won't work. I can't find that - as long as WHATEVER can be set to export in this format, or something that can be transformed to this format, I think it'll work. (That could turn into a Phase Two "transmogrify SOMETOOL exports to Arctos import format" tool.)

dustymc commented 4 years ago

The "component" (this isn't a 'component,' naming stuff is HARD) loader template is amazing, I just used the new classification loader to create http://test.arctos.database.museum/name/Sorex%20cinereus#Arctos

Demonstrated:

table-level rules are unavoidable; nomenclatural_code=imadeitup (and similar) successfully failed
this will create multiple non-classification terms of identical "rank", which is sometimes necessary
this will create mixed (ranked and unranked) hierarchies, which is necessary for things like phylocode
this will create utter nonsense at very large scales, which may be evidence that we need to make source functional so that we can control what can be loaded (eg, this tool used against the "Arctos" classification could turn catastrophic very quickly)

This tool, like the old one, replaces source-at-name data; http://test.arctos.database.museum/name/Copper#ArctosMinerals (multiple classifications in a single source-at-name) is not possible.

It's currently capable of handling 6 non-classification term-pairs and 10 classification; I can extend that relatively easily, but doing so "requires" (is much easier when, anyway) the table being empty. If anyone has data requiring more terms, this is a good time to implement.

Someone break it, please.

Jegelewicz commented 4 years ago

I will test this - but it might be tomorrow....

/remind me to do this tomorrow

reminders[bot] commented 4 years ago

@Jegelewicz set a reminder for Sep 22nd 2020

reminders[bot] commented 4 years ago

:wave: @Jegelewicz, do this

ArctosDB / arctos

classification bulkloader specifications (and hierarchical editor, maybe) #3110