ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
61 stars 13 forks source link

Allow variably ranked acceptedness of identifications (was: order identifications on catalog record) #3540

Closed Jegelewicz closed 1 year ago

Jegelewicz commented 3 years ago

Issue Documentation is http://handbook.arctosdb.org/how_to/How-to-Use-Issues-in-Arctos.html

Is your feature request related to a problem? Please describe. https://arctos.database.museum/guid/UAM:EH:0610-5898 has a number of IDs that mix cultural and biological taxonomic designations; I’d like to have the cultural ones go first, then biological

Describe what you're trying to accomplish Organize various identifications on a catalog record

Describe the solution you'd like Allow collections to create an order in manage collection (cultural, biological, mineral for example)

Describe alternatives you've considered Allow manual ordering of each set of ids - might be nice, but also might be time intensive as opposed to a general rule

Additional context Add any other context or screenshots about the feature request here.

Priority Please assign a priority-label. Unprioritized issues gets sent into a black hole of despair.

AJLinn commented 3 years ago

Thanks for creating this issue @Jegelewicz. Here's what I have listed in my 'manage collection' section for taxonomy ordering also - seems like there should be some connection between the two?

Screen Shot 2021-03-25 at 6 55 51 AM

I wonder if it's possible to have these with a "drag row here" box like with agents in the catalog record?

Screen Shot 2021-03-25 at 6 59 41 AM

This is helpful for ordering the sequence of ownership for objects. I'd like to do the same with the taxonomy related to the biological materials our items are made from, based on the abundance of the material on the piece. For this fish skin parka, I'd want "Parka" first, then "salmon" "caribou" and lastly "alder".

dustymc commented 3 years ago

Allow collections to create an order in manage collection (cultural, biological, mineral for example)

"Impossible" isn't the right word, but it might be close enough - that would be incredibly complicated, and I don't think there's any way, no matter what resources we had, to make it predictable, at least not without completely breaking our taxonomy model.

Allow manual ordering of each set of ids -

That's not much problem, but it's significant development. It would support things like "based on the abundance of the material on the piece." It would also provide a nonmagical path to "cultural first" (even when the cultural item uses a name that was created for biology and is managed in some mineral-centric classification) - users could just put it there.

It might be a workload increase - that would need discussed-->planned-->understood before proceeding.

It also provides an elegant solution to the "more than one accepted" thing that comes up every now and then. For things that expect binary, order=1 could be treated as accepted and everything else as unaccepted. For those willing to embrace the gradient, you could have

  1. Parka (what we currently think is most relevant)
  2. fish (relevant, just less-so because we're not really fish-people)
  3. weeds
  4. socks (someone thought this at some point, we think they were wrong so they're at the bottom of the list, but maybe they were actually on to something and we just can't grok it yet so here it is, not "unaccepted" just down here out of the way)

I think I like it, but I'm not sure I'd want to use it if I just had a dead rat (complicated model, simple data), nor if I had https://arctos.database.museum/guid/MSB:Mamm:55245 (lots of about-equally-important IDs). That's possibly "just UI" but I'm not sure I see how just yet.

AJLinn commented 3 years ago

I think I like it,

Cool. I guess let's put it on the "wish list" and think/talk about how it might be developed and/or used by various collections. The "more than one accepted" thing is an interesting one for cultural collections when we have the classic "unclassifiable object" and the associated "unidentified object" where we could add some options that people have guessed about but have not been confirmed.

Screen Shot 2021-03-25 at 8 06 26 AM
campmlc commented 3 years ago

"More than one accepted ID" is also needed for environmental samples, and ordering them by abundance or GenBank ID confidence, for example, would be useful.

On Thu, Mar 25, 2021 at 10:09 AM Angela Linn @.***> wrote:

  • [EXTERNAL]*

I think I like it,

Cool. I guess let's put it on the "wish list" and think/talk about how it might be developed and/or used by various collections. The "more than one accepted" thing is an interesting one for cultural collections when we have the classic "unclassifiable object" and the associated "unidentified object" where we could add some options that people have guessed about but have not been confirmed.

[image: Screen Shot 2021-03-25 at 8 06 26 AM] https://user-images.githubusercontent.com/17605945/112504810-0323ca80-8d41-11eb-9576-b4f5aa8522f6.png

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3540#issuecomment-807015621, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBFQW6VLMO5P5KTDPADTFNNZ5ANCNFSM4ZZLSRKA .

dustymc commented 3 years ago

also needed for environmental samples

You can use A {string} with identification="some goo" and associated taxa={list of stuff that PCRed out}, but....

ordering them by...

That does need something more than the A {string} can provide - I think it's about the same as "have guessed about but have not been confirmed" - the full ID (confidence, sensu, determiners, etc.) for each "entity" would definitely be more informative.

I suppose one option for things like https://arctos.database.museum/guid/MSB:Mamm:55245 would just be to make the ordering a non-unique integer, so you could...

1. Something (we like this one for some reason)
2. Something Else (we like this one a bit less than (1))
2. More Something Else (we like this one a bit less than (1))
2. Even More Something Else (we like this one a bit less than (1))
2. So Much Something Else (we like this one a bit less than (1))
3. This is third, it could have unsorted friends

Still plenty of complexity to sort out (I'm working on the citation bulkloader, which creates identifications, at the moment - it would be affected by this) but this is starting to feel realistic.

Nicole-Ridgwell-NMMNHS commented 3 years ago

This could also be useful in paleo where you have a slab of rock with, say, 200 fossils from 10 different taxa, where it would be very complicated to individually catalog things.

campmlc commented 3 years ago

I'd like to request to have this functional in next couple of weeks to test for possible environmental sample for SPNHC talk.

On Tue, Apr 13, 2021 at 12:17 PM Nicole-Ridgwell-NMMNHS < @.***> wrote:

  • [EXTERNAL]*

This could also be useful in paleo where you have a slab of rock with, say, 200 fossils from 10 different taxa, where it would be very complicated to individually catalog things.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3540#issuecomment-818953736, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBDOBY6YJJKLQGIGRVDTISDBXANCNFSM4ZZLSRKA .

dustymc commented 3 years ago

slab of rock with, say, 200 fossils from 10 different taxa

I'm not saying I wouldn't use it that way, but that's not cataloging the item of scientific interest either; those are really distinct THINGS (that just happen to be difficult to separate). Cataloging them is pretty trivial - similar things happen in insect collections on a regular basis - but associating the correct catalog record with a tiny speck on the rock might be another story. I don't have any better ideas (maybe the paleo pollen folks or similar do??), but the implications (eg, poor links between THAT teensy fossil and the literature) should be understood before using this approach.

couple of weeks

I'm not sure that's at all realistic. Most obviously, there are a ton of forms (adding IDs, bulkloading citations, etc.) where inserting an "accepted" record makes everything else "unaccepted." If that's no longer binary then all of those forms have to be redesigned. (So would everything that uses the binary to find the "best" ID, but that should mostly be behind the scenes.) I have no idea at all what that might look like; I need input from The Community.

The simplest use case is probably adding an identification to the "detail page" - what should happen there? Maybe the answer to that can help guide more complex and less interactive situations.

I suspect the most likely problem (if we go with https://github.com/ArctosDB/arctos/issues/3540#issuecomment-807135914) will be the intersection of multiple equally-highest-ranked IDs and flattening - given 12 "most-preferred" IDs, what gets stashed in FLAT or sent to GBIF?

And all that said, I'm not sure I understood the initial request. After this afternoon's webinar, I think maybe we just need to order the taxa used in A {string} IDs? Are we talking about these - three IDs are in the screenshot...

Screen Shot 2021-04-13 at 1 26 10 PM

...or these 5 taxa used in a single identification?

Screen Shot 2021-04-13 at 1 28 29 PM
Jegelewicz commented 3 years ago

Unfortunately, I think we are talking about both....but this issue was started based upon @AJLinn 's request in the webinar so it should focus on

these 5 taxa used in a single identification?

Screen Shot 2021-04-13 at 1 28 29 PM
Jegelewicz commented 3 years ago

Angie wants the cultural name(s) to appear first as they are more important to her.

campmlc commented 3 years ago

Yes, both would be highly useful but solve Angie's first.

On Tue, Apr 13, 2021, 5:03 PM Teresa Mayfield-Meyer < @.***> wrote:

  • [EXTERNAL]*

Angie wants the cultural name(s) to appear first as they are more important to her.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/ArctosDB/arctos/issues/3540#issuecomment-819106548, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADQ7JBDHWIZ5MZ3QHOI2C2DTITEVJANCNFSM4ZZLSRKA .

dustymc commented 3 years ago

we are talking about both

We could:

"A and B and C and ...." have one place for metadata, and would require new formulae for every number of involved taxa.

I introduced the 'many taxa on A {string} IDs' thing because it was easy (just drop a unique composite key), we didn't have cultural taxonomy at the time, and we were/are limited to one accepted. It gets the important bits across, but it has pretty severe limitations - eg does the ID metadata apply to all of the taxa, or some average of everything involved, or ???? Is the parka expert who identified the thing really equally comfortable with phocids and plants and minerals and whatever else is included? There's no structured way to say, even if we do somehow figure out a way to add ordering without conflicting with the "core" mechanism for ordering (taxa formula).

Multiple IDs has none of those restrictions. The parka's IDs would include

That should be interpreted as "mostly parka, then seal and alder, then copper." It contains the same taxa information as the A {string} ID (assuming the seal isn't a hybrid &etc., which isn't possible in the A {string} multitaxa approach), but

  1. there's ordering, and
  2. there's a place for metadata with the "component IDs"

Potentially not-so-great (or really great, depending on your viewpoint!) features include:

  1. It would be a bit more "verbose" to use - but I don't think it would need to REPLACE the multitaxa A {string}, so you'd only need to deal with the complexity when you have complex data.
  2. It would be a bit more complex to display/understand; it's multiple complex data objects which probably can't readily be squished into a nice compact human-readable string.

We'd first need to get past the hurdles outlined above - eg, what happens when you {do thing that currently depends on the binary nature of acceptedness}, how exactly does this get displayed in tabular format, etc.?

@AJLinn does that approach work for you?

Jegelewicz commented 2 years ago

Create two "accepted" identifications - one cultural, one biological and we need to figure out how to get the cultural ID at the top. We could use this as a test case - @AJLinn want to play?

@Nicole-Ridgwell-NMMNHS maybe you have an example ot two that we can play around with?

Nicole-Ridgwell-NMMNHS commented 2 years ago

maybe you have an example.

I don't know if I have anything right off the bat where ordering matters much, but for greater than 2 accepted IDs needed, here is a good example: https://arctos.database.museum/guid/NMMNH:Paleo:14152 needs four accepted IDs, Evazoum sirigui, Grallator cursorius, Tetrapoda, and Theropoda

Jegelewicz commented 2 years ago

@dustymc can we work with Nicole's example above?

dustymc commented 2 years ago

needs four accepted IDs

Five: don't forget the rock!

I think https://github.com/ArctosDB/arctos/issues/3540#issuecomment-819673955 starting at "Multiple IDs.. " does that.

(So does the first part of that comment, a new taxa formula, as long as you're OK with the identification being formulaic - "Evazoum sirigui and Grallator cursorius and Tetrapoda and Theropoda and rock." That would take a code table request and a fair bit of code to deal with more than 2 names in an ID.)

"rank" (aka order) doesn't necessarily have to be unique, so 4 unsorted (NULL rank, same rank, IDK, needs ironed out if we proceed) accepted IDs (and/or 54 sorted unaccepted, or any crazy mix thereof) should work well enough.

An Entity-ish alternative would be to call the current record (or a new one in the entity collection or whatever) "rock with lots of tracks" and then catalog the tracks individually and link them to the "parent." I think that's probably a bit more sciencey and a bit less representative of how things are actually cataloged and stored, and I don't think multiple accepted IDs can cause the things we created Entities to avoid, but that possibility deserves a very close look before we write any code.

AJLinn commented 2 years ago

Sorry to miss all of this... guess I zoned out for a year...

Multiple IDs has none of those restrictions. The parka's IDs would include

parka: id by {parka expert} using {whatever method they use} sensu {some publication}, on {date}, confidence high, rank 1 Erignathus barbatus x Phoca hispida (it happens!) by {seal person} using {PCR} confidence high, rank 2 Alnus rubra by {botanist} using {plant magic}....., rank 2 Copper by {student} using "looks like copper" confidence low, rank 3 etc. That should be interpreted as "mostly parka, then seal and alder, then copper." It contains the same taxa information as the A {string} ID (assuming the seal isn't a hybrid &etc., which isn't possible in the A {string} multitaxa approach), but

there's ordering, and there's a place for metadata with the "component IDs"

So that means on this record instead of having this list with one set of metadata:

Screen Shot 2022-02-16 at 1 49 11 PM

I'd add multiple determinations, each with their own metadata. And the order in which I enter them is their "rank"? What happens if you need to correct one and add something new, like a correction... what happens to the order then?

dustymc commented 2 years ago

zoned out for a year..

It happens!

multiple determinations, each with their own metadata.

Yep.

order in which I enter them is their "rank"

No, rank would be some explicit new thing.

what happens to the order

You tell me, but I think whatever you want - I'm envisioning this as some sort of (probably nonunique) integer (which could be drag-n-drop or whatever in the UI) that you'd have complete control over. If you want some particular order then you can do that, if you don't care then it wouldn't add anything (other than the ability to have multiple accepted, which you could use or not as you wish).

I think I've flip-flopped a few times (at least in my head) on keeping the accepted flag around or replacing it with the rank, right now I'm leaning towards keeping it which seems like it would be a more explicit/less confusing way to say "nope, wasn't this at all" when necessary, even if it is almost-sorta redundant with the rank.

Jegelewicz commented 2 years ago

@AJLinn my idea was that you would have only two identifications - one with Parka and the other with all of the biological names using the A {string}, but then we would still need a way to prioritize them so that the cultural one showed up at the top.

dustymc commented 2 years ago

one with Parka and the other with all of the

I don't think this would be reason to get rid of the A {string} with lots of taxa thing, so yes, both

and

would work, and the first could incrementally be turned into the second as things get used and identifications get made.

AJLinn commented 2 years ago

I'll test it out and share my results.

dustymc commented 2 years ago

@DerekSikes the solution to the problem you brought up in the webinar is in here. Currently, you have to "accept" exactly one Identification in Arctos - this would allow you to accept both the old and your taxonomy update/split/whatever (which probably doesn't fit into DWC/GBIF in any way, but they'll eventually figure it out...).

dustymc commented 2 years ago

This keeps coming up, I can't see any insurmountable obstacles, suggest prioritizing.

Suggest we continue to make the "accepted" distinction via

And migrate with

which would be no functional change.

Scrunching multiple accepted and ranked identification.scientific_name down into a single string for the UI/export/etc. that requires such a thing would still be a bit ugly, but that's "just UI." string_agg(identification.scientific_name,' | ' order by identification_order,identification.scientific_name) seems a useful starting point (it results in no immediate change with the migration path suggested above). This should be accompanied by making sure that full/complex identifications are available in some way from wherever the scrunched strings can be found.

Jegelewicz commented 2 years ago

We should be sending the Identification History extension along with occurences for the aggregators.

dustymc commented 2 years ago

I suspect that's what's happening behind the scenes right now - that @tucotuco fella has been asking a lot of identification-related questions lately, anyway.

(And see https://github.com/ArctosDB/internal/issues/185, this is a good example of the sorts of things that I think maybe should be separated, both so they don't distract from the core issue and so they're not lost when this is closed as the core issue is addressed.)

Jegelewicz commented 2 years ago

Let's use a project and/or the dwc terms label for this stuff. I just don't need more repos to monitor if we can help it.

dustymc commented 2 years ago

Changing title to better reflect proposed implementation.

Jegelewicz commented 2 years ago

Will this mean there can only be one accepted ==> order=1?

dustymc commented 2 years ago

For migration, yes (because that's what's in the source/current data). Post-migration, see https://github.com/ArctosDB/arctos/issues/3540#issuecomment-1146081349 - any number of IDs in any order would be acceptable.

campmlc commented 2 years ago

This sounds neat.

Jegelewicz commented 2 years ago

@dustymc just want to confirm that after the migration described here is done, we can assign 1 to multiple identifications? But can we also add other rankings (2, 3, etc.)?

@ArctosDB/taxonomy suggests we bring to the AWG for approval.

dustymc commented 2 years ago

Nothing's been written so it's all up for discussion, but my current thoughts are that IDs can have any nonunique value so yes "we like these 56 IDs equally" is completely valid. They could all be 'best' or 'worse' or somewhere in between, and there could be any number of groups of equally-ranked IDs.

The actual values (excepting zero) would be for sorting only - the only thing special about a "1" is that it sorts before a "2" in that particular record. Another record with a "9999" in the same position is identical, values don't imply anything other than within-record sort order. There would be no ability to say "this is 100% correct" or "this is 12% correct" just "this one then that one" among the accepted IDs of a single catalog record.

The only 'rank' with an absolute value would be 0, which would be sorted into its own pile (and treated about like we treat unaccepted IDs now - not necessarily wrong, just not something that's curatorially preferred for whatever reason).

Jegelewicz commented 2 years ago

Issue summary for AWG

Currently there is only one accepted identification allowed, but we have potential uses for more than one accepted ID such as this and CHAS herbarium sheets with multiple specimens)

Dusty suggests:

IDs can have any nonunique value so yes "we like these 56 IDs equally" is completely valid. They could all be 'best' or 'worse' or somewhere in between, and there could be any number of groups of equally-ranked IDs.

The actual values (excepting zero) would be for sorting only - the only thing special about a "1" is that it sorts before a "2" in that particular record. Another record with a "9999" in the same position is identical, values don't imply anything other than within-record sort order. There would be no ability to say "this is 100% correct" or "this is 12% correct" just "this one then that one" among the accepted IDs of a single catalog record.

The only 'rank' with an absolute value would be 0, which would be sorted into its own pile (and treated about like we treat unaccepted IDs now - not necessarily wrong, just not something that's curatorially preferred for whatever reason).

and we would start by doing this:

we continue to make the "accepted" distinction via

order>0: accepted (or "not rejected" or however you want to look at it), potentially one of many, order by identification_order (new field), and
order=0: not accepted ("is rejected," whatever), continue to treat as "different" in UIs and exports and such

And migrate with

accepted ==> order=1
unaccepted ==> order=0

which would be no functional change.

Scrunching multiple accepted and ranked identification.scientific_name down into a single string for the UI/export/etc. that requires such a thing would still be a bit ugly, but that's "just UI." string_agg(identification.scientific_name,' | ' order by identification_order,identification.scientific_name) seems a useful starting point (it results in no immediate change with the migration path suggested above). This should be accompanied by making sure that full/complex identifications are available in some way from wherever the scrunched strings can be found.

After the initial migration, everyone would be free to re-order their identifications as they wish.

Jegelewicz commented 2 years ago

We will probably need to consider how we publish identification to aggregators AND publish the Darwin Core Identification History extension.

dustymc commented 2 years ago

how we publish identification to aggregators

As usual, we can put a lot of work into some arbitrary thing that won't much be used and can't possibly be a decent representation of the data, or we can just give them what we have and let the aggregators worry about how to cram it into whatever weird little pigeonholes they want to construct. I recommend focusing all efforts on the latter.

Nicole-Ridgwell-NMMNHS commented 2 years ago

Already, if you use an "and" identification, GBIF takes whatever is listed first and makes that the identification. If we concatenate for GBIF, if collections want to try to control what GBIF takes as the ID, they could make that ID 1, and all the others 2?

Identified by and Identified date might be more complicated?

campmlc commented 2 years ago

I support this.

Jegelewicz commented 1 year ago

AWG approved for implementation 2023-01-05

Setting to next task - do we need a milestone of AWG approved?

dustymc commented 1 year ago

For @DerekSikes : make sure a single 1 (highest-rank, whatever values turn out to be) and zero or more 0 (lowest rank) continue to function as current, and that that's documented.

dustymc commented 1 year ago

EDIT: remapped to 0-10 - IDK why 100 'ranks' would ever be necessary and it makes ugly dropdowns.

Tentatively mapped as

alter table identification add constraint ck_identification_order check (identification_order between 0 and 10);

I don't think the above ("highest-rank") is necessary, the '1' (currently 'accepted') will continue to work as it does, as long as it's accompanied by only '0' (current and future ~'not preferred').