Closed dustymc closed 4 years ago
Maybe we should start off discussing this in Taxonomy Committee? We are meeting on Wednesday at 2PM MDT and we could focus on this issue.
Committee
I can make it if that would be useful.
Yes, that would be awesome!
Meeting: for now, try to make something stable-ish that uses https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_term as headers
Unranked-ish taxa (eg mineral) will fake it with terms like rock_thing_1, rock_thing_2
Crazy idea: how hard would it be to get data into
scientific_name, noclass_term_1,noclass_term_type_1,...,noclass_term_10,noclass_term_type_10,class_term_1,class_term_type_1,...class_term_12,class_term_type_12
where
scientific_name
is the key
noclass_term_1,noclass_term_type_1,...,noclass_term_10,noclass_term_type_10
are classification=no terms from the code table, for example:
noclass_term_1,noclass_term_type_1
==>preferred_name,some value for preferred name
noclass_term_2,noclass_term_type_2
==>author_text,some value for author
and class_term_1,class_term_type_1,...class_term_12,class_term_type_12
are classification=yes terms from the code table, for example:
class_term_1,class_term_type_1
==>kingdom, Animalia
class_term_6,class_term_type_6
==>NULL, SomeUnrankedTermInTheMiddleOfTheHierarchy
class_term_7,class_term_type_7
==>NULL, SomeOtherUnrankedTermInTheMiddleOfTheHierarchy
class_term_10,class_term_type_10
==>subspecies, Some sub species
The only real limitation would be on the depth of the hierarchy (or number of columns in CSV format), and that would be fairly simple to increase. This would handle
and provide a mechanism for multiple non-classification terms - 3 different values for taxon_status (https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_status), for example.
It would be possible to skip say xx_3 for a few records without superfamily data in order to keep all family in xx_4, which isn't QUITE as convenient as just having a column named "family" but perhaps close enough, given who uses this tool (very few) and how often (not much, and hopefully less given the new flexibility in source<-->collection links). The only important information in the column names would be "lower numbers are more parentish in the hierarchy."
Does that seem usable?
Sounds interesting! Lets try it?
On Wed, Sep 16, 2020 at 6:58 PM dustymc notifications@github.com wrote:
- [EXTERNAL]*
Meeting: for now, try to make something stable-ish that uses https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_term as headers
Unranked-ish taxa (eg mineral) will fake it with terms like rock_thing_1, rock_thing_2
Crazy idea: how hard would it be to get data into
scientific_name, noclass_term_1,noclass_term_type_1,...,noclass_term_10,noclass_term_type_10,class_term_1,class_term_type_1,...class_term_12,class_term_type_12
where
scientific_name is the key
noclass_term_1,noclass_term_type_1,...,noclass_term_10,noclass_term_type_10
are classification=no terms from the code table, for example:
noclass_term_1,noclass_term_type_1==>preferred_name,some value for preferred name noclass_term_2,noclass_term_type_2==>author_text,some value for author
and class_term_1,class_term_type_1,...class_term_12,class_term_type_12
are classification=yes terms from the code table, for example:
class_term_1,class_term_type_1==>kingdom, Animalia class_term_6,class_term_type_6==>NULL, SomeUnrankedTermInTheMiddleOfTheHierarchy class_term_7,class_term_type_7==>NULL, SomeOtherUnrankedTermInTheMiddleOfTheHierarchy class_term_10,class_term_type_10==>subspecies, Some sub species
The only real limitation would be on the depth of the hierarchy (or number of columns in CSV format), and that would be fairly simple to increase. This would handle
- all ranked classification terms
- all unranked classification terms
- a mix of ranked and unranked classification terms
and provide a mechanism for multiple non-classification terms - 3 different values for taxon_status ( https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_status), for example.
It would be possible to skip say xx_3 for a few records without superfamily data in order to keep all family in xx_4, which isn't QUITE as convenient as just having a column named "family" but perhaps close enough, given who uses this tool (very few) and how often (not much, and hopefully less given the new flexibility in source<-->collection links). The only important information in the column names would be "lower numbers are more parentish in the hierarchy."
Does that seem usable?
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3110#issuecomment-693744146, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBBIPDKMH7SM3LDOBNLSGFNM3ANCNFSM4RLZJDKA .
It would be possible to skip say xx_3 for a few records without superfamily data in order to keep all family in xx_4, which isn't QUITE as convenient as just having a column named "family" but perhaps close enough, given who uses this tool (very few) and how often (not much, and hopefully less given the new flexibility in source<-->collection links). The only important information in the column names would be "lower numbers are more parentish in the hierarchy."
Would you HAVE to skip anything? If we are assigning a term type (Family), then does it matter what level of the hierarchy it is in? Searches could then be across the whole damn thing for any term types = Family plus whatever "family" you wanted? Is that a thing?
Could we do this in test any time soon?
HAVE to skip anything?
No. I believe this would support the full Arctos model of "whatever you want wherever you want it" but at limited depth. Fill in every cell, every cell but the ranks, just xx_86 (which will end up at the top of the hierarchy) and xx_100 (child of xx_86), whatever.
does it matter what level of the hierarchy it is in
Very much no. Family-->kingdom-->family-->NULL-->family-->subspecies-->family fits in both the core Arctos model and this. That's necessary in the core model - I'm no taxonomist, if one says it's valid (eg, by loading their data to GlobalNames), who am I to argue? I'm not so sure it's desirable in a local management tool, but we're careful who has manage_taxonomy so....
test any time soon
I think so - as long as this is just feeding Arctos and not trying to be something it can't (eg, enforce consistency or avoid 12 family assertions in a single classification) it should be pretty fast to develop.
Sounds like "go" to me....
I added https://handbook.arctosdb.org/documentation/taxonomy.html#usage-of-sources in part to see if I could find a reason that this approach won't work. I can't find that - as long as WHATEVER can be set to export in this format, or something that can be transformed to this format, I think it'll work. (That could turn into a Phase Two "transmogrify SOMETOOL exports to Arctos import format" tool.)
The "component" (this isn't a 'component,' naming stuff is HARD) loader template is amazing, I just used the new classification loader to create http://test.arctos.database.museum/name/Sorex%20cinereus#Arctos
Demonstrated:
This tool, like the old one, replaces source-at-name data; http://test.arctos.database.museum/name/Copper#ArctosMinerals (multiple classifications in a single source-at-name) is not possible.
It's currently capable of handling 6 non-classification term-pairs and 10 classification; I can extend that relatively easily, but doing so "requires" (is much easier when, anyway) the table being empty. If anyone has data requiring more terms, this is a good time to implement.
Someone break it, please.
I will test this - but it might be tomorrow....
/remind me to do this tomorrow
@Jegelewicz set a reminder for Sep 22nd 2020
:wave: @Jegelewicz, do this
The taxonomy classification loader is non-functional and needs rebuilt. How should it work, and what should it be capable of?
https://github.com/ArctosDB/arctos/issues/2592 is a request for multiple terms of the same type (or "rank"), which isn't compatible with CSV.
Phylocode and cultural classifications may both deal in unranked terms, which isn't very compatible with CSV.
The old/dead loader used CSV based on https://arctos.database.museum/info/ctDocumentation.cfm?table=cttaxon_term, and then died every time someone changed a term (which was often).
The hierarchical editor exports classification-loader CSV (a human review step was deemed necessary), which imposes "must have type/rank" limitations on the tool. Given the ability for a collection to prefer smaller "sub-classifications" it might be possible to wire the hierarchical editor directly into certain classifications, which might change the big picture.
I don't know how to reconcile the various needs into a single tool, or even precisely what the scope of that discussion should be. I think group discussion would be useful; can we get this on the AWG agenda?