ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

Feature Request - control what identifiers an agent may issue #7837

Closed dustymc closed 2 weeks ago

dustymc commented 3 months ago

Is your feature request related to a problem? Please describe.

See https://github.com/ArctosDB/arctos/issues/7808#issuecomment-2139730876

"Collection agents" (those who wear a collectionID) are issuing all sorts of identifiers. Most of this does not seem realistic to me. Probably other agents are also being recorded as issuing identifiers which they did not actually issue.

Describe what you're trying to accomplish

  1. Clean data.
  2. Make it easy for users to understand how to do things.

Describe the solution you'd like

Some mechanism to control what identifiers certain agents may issue.

Describe alternatives you've considered

Do nothing.

Additional context

I can dig around in the data if there's any interest in proceeding with this.

Priority

Not clear; preventing future tangled messes seems important to me, this has caused some problems already (https://github.com/ArctosDB/arctos/issues/7025), but both actionable (eg URLs) and local (eg, NK) identifiers can mostly function without this information so filtering is possibly of relatively little value.

Jegelewicz commented 3 months ago

I think this probably makes sense. It would be good to see an updated version what the problems are currently.

dustymc commented 3 months ago

updated version

https://github.com/ArctosDB/arctos/issues/7808 will make it much easier to get what caught my eye, but I think it's probably not possible to know what the problems are until they become problematic. It's hard to imagine that random agents doing random stuff won't find a way to get weird, but IDK if that ==> "problem."

OK, dug around a bit, yea mess.

https://arctos.database.museum/search.cfm?id_issuedby=%3DWoodland%20Park%20Zoo&oidtype=collector%20number - collector number "refers to a person's field catalog" that's just wrong (but possibly/hopefully in some way which doesn't much change functionality).

https://arctos.database.museum/search.cfm?id_issuedby=%3DBeverly+J.+Witte&oidtype=field+number - don't think VP uses 'unique number assigned to a collecting event'

Lots of https://github.com/ArctosDB/arctos/issues/7836-involved messes (people are not institutions....)

Here's some data, maybe it'll lead somewhere, this seems to be enough to understand that the intersection of type and agent is at least sometimes arbitrary, which seems a bit sub-optimal to me:

https://docs.google.com/spreadsheets/d/1jdC08vXtbdNhVXDIUz2qwx8ZDkLwds0VpWr4mZrTvXk/edit#gid=714618271

From that I noticed 500+ malformed (==broken) genbank links that are attributed to the correct agent (so controllable/detectable)

arctosprod@arctos>> select  count(*)   from coll_obj_other_id_num where issued_by_agent_id=21349032 and substr(display_value,0,37) != 'http://www.ncbi.nlm.nih.gov/nuccore/' ;
 count 
-------
   571

Some of them don't seem to be that at all, BUT NCBI seems to have a pretty good 404 handler - I guessed that SRR19593543 (https://arctos.database.museum/search.cfm?id_issuedby=%3DNCBI%20Nucleotide%20(GenBank)&oidnum=%3DSRR19593543) should be http://www.ncbi.nlm.nih.gov/nuccore/SRR19593543 (it should not) and got magicked to https://www.ncbi.nlm.nih.gov/sra/SRR19593543 (which should have been issued by https://arctos.database.museum/agent/21349034, not https://arctos.database.museum/agent/21349032).

So yea, clearly a problem, at least partially involving our inability to complete the migration. I'll bump priority.

ewommack commented 3 months ago

collector number "refers to a person's field catalog"

That's how we use it here at the UWYMV, and since it refers to a person's individual catalog I would think you would want an Agent to have be the one associated with that collector number. The list you provided seemed to have a lot of examples of how I thought the new system was supposed to be used?

dustymc commented 3 months ago

lot of examples of how I thought the new system was supposed to be used?

Yup, most of the list seems to be just fine, and yay us for that. There are no filters on my query, that's just every agent who's issued anything and they types they've issued, a first-step exploratory view.

ewommack commented 3 months ago

Oh phew. I thought things were changing again.

mkoo commented 3 months ago

Thanks-- I took a look at the google spreadsheet. It's helpful but also seems to include a lot of legit data too. Here's one way to view the data: Organizations vs People

In Orgs, these types (identifier and institutional catalog number) are redundant so merge/ cleanup seems simple.

For people, many have both collector number and preparator number. That's a curatorial distinction that I foresee we not going to be able to solve here in the presentday. Maybe in the future. Ideally we'd just call it catalog number or personal catalog or something else since obviously it's usually the same series of numbers-- it's just traditional to carry a little more distinction that someone just prep the specimen vs did field work (thus, do you bother looking for a field journal or not?) We do have better ways to distinguish of course (agent roles) but for the identifier_type? Not the best place for that! But sure, I predict a fight to retain this ancient methodology.

So maybe we limit people to just two types moving forward And proposed clean-up and remove the low frequency ones like processing number, field number, etc. I suggest the CT committee review tomorrow

dustymc commented 3 months ago

lot of legit data

Yes, https://github.com/ArctosDB/arctos/issues/7837#issuecomment-2148081522

(identifier and institutional catalog number) are redundant

Yes, these are used completely interchangeably/arbitrarily, see also https://github.com/ArctosDB/arctos/issues/7836, cleaning up any bit of this mess brings clarity to the rest.

people, many have both collector number and preparator number.

Yep, no problem.

field number

Yea, I have no idea what to do about that. 99.9999% of the usage outside fish collections is just wrong, but I'd not want to try to write code to catch that either. Call it small fry and ignore for now...

Organizations vs People

Interesting, and that would catch much of the most-obvious "this can't possibly be right...." usage. BUT....

See eg https://github.com/ArctosDB/arctos/issues/7649, there are a ton of clearly-not-people agents entered as people, our not-great data kinda always contaminates something else...

Jegelewicz commented 3 months ago

fish collections ... small fry

I see what you did there....

dustymc commented 2 months ago

https://github.com/ArctosDB/arctos/issues/7649#issuecomment-2195040148 + https://github.com/ArctosDB/arctos/issues/7836 ==> https://arctos.database.museum/guid/MSB:Mamm:145728

Screenshot 2024-06-27 at 14 30 34

A low-quality person-agent acting as an institution doesn't seem optimal.

dustymc commented 3 weeks ago

I'm going next task on this, I can't see any possible drawbacks, there are clear problems in existing data that this would prevent.

campmlc commented 3 weeks ago

This happened in just this week in our collections where a CM added themselves as both Issued by and assigned by agent to a USGS record with the ID type as "identifier". The USGS ID is actually an "institutional catalog number" but we've been discouraged from using that. Had that been used instead of "identifier", it would have been clear that the USGS number was issued by the institution but assigned by a person. But with the use of "identifier" as the ID type, people are getting confused, and rightly so.

dustymc commented 3 weeks ago

This looks like it's going to work reasonably well, and can be expanded as needed. I'll get it integrated with the triggers for next release.

There's some first-pass data in https://docs.google.com/spreadsheets/d/1zJXr-UTYc5fyNpVp80z7x0cNHkie71I-HmQO028Fll4/edit?gid=1602420022#gid=1602420022, summary below. Please let me know of any problems in the check, and of course if I can help clean anything up. (No cleanup is necessary, data which doesn't follow appropriate patterns just won't be able to save after this goes through.)


 count | guid_prefix |                                                                                                                           check_status                                                                                                                           
-------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     2 | BSUNH:Mamm  | collector number and preparator number may only be issued by person agents
    37 | CHAS:Bird   | collector number and preparator number may only be issued by person agents
     1 | CHAS:Mamm   | identifiers which contain guid/MCZ:Mamm may only be issued by MCZ:Mamm (https://arctos.database.museum/agent/21355896)
     1 | CHAS:Teach  | collector number and preparator number may only be issued by person agents
     6 | CRCM:Mamm   | collector number and preparator number may only be issued by person agents
    30 | DMNS:Bird   | collector number and preparator number may only be issued by person agents
    32 | DMNS:Inv    | collector number and preparator number may only be issued by person agents
     2 | DMNS:Mamm   | BoLD may only issue identifiers of the pattern http[s]://[www].boldsystems.org/index.php/Public_RecordView?processid={code} or http[s]://[www].boldsystems.org/connectivity/specimenlookup.php?processid={code}
     4 | DMNS:Mamm   | collector number and preparator number may only be issued by person agents
     1 | DMNS:Mamm   | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
    11 | DMNS:Mamm   | identifiers which contain ncbi.nlm.nih.gov/bioproject may only be issued by NCBI BioProject (https://arctos.database.museum/agent/21349072)
     1 | DMNS:Mamm   | identifiers which contain ncbi.nlm.nih.gov/bioproject may only be issued by NCBI BioProject (https://arctos.database.museum/agent/21349072) | NCBI BioSample may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/biosample/{code}
    10 | DMNS:Para   | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
    12 | HSUVM:Mamm  | collector number and preparator number may only be issued by person agents
   207 | KSB:Mamm    | collector number and preparator number may only be issued by person agents
     1 | KSB:Teach   | collector number and preparator number may only be issued by person agents
     1 | MLZ:Bird    | collector number and preparator number may only be issued by person agents
   180 | MMNH:Bird   | collector number and preparator number may only be issued by person agents
    57 | MMNH:Edu    | collector number and preparator number may only be issued by person agents
     6 | MMNH:Mamm   | collector number and preparator number may only be issued by person agents
     4 | MSB:Bird    | Local identifiers may not have issued_by_agent_id
   174 | MSB:Host    | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
     1 | MSB:Host    | identifiers which contain guid/MCZ:Mamm may only be issued by MCZ:Mamm (https://arctos.database.museum/agent/21355896)
   130 | MSB:Mamm    | BoLD may only issue identifiers of the pattern http[s]://[www].boldsystems.org/index.php/Public_RecordView?processid={code} or http[s]://[www].boldsystems.org/connectivity/specimenlookup.php?processid={code}
     2 | MSB:Mamm    | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
     1 | MSB:Mamm    | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code} | identifiers which contain ncbi.nlm.nih.gov/biosample may only be issued by NCBI BioSample (https://arctos.database.museum/agent/21348953)
     1 | MSB:Mamm    | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code} | identifiers which contain ncbi.nlm.nih.gov/sra or ncbi.nlm.nih.gov/Traces/sra may only be issued by NCBI SRA (https://arctos.database.museum/agent/21349034)
     1 | MSB:Mamm    | identifiers which contain ncbi.nlm.nih.gov/nuccore may only be issued by GenBank (https://arctos.database.museum/agent/21349032)
   216 | MSB:Para    | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
     1 | MSB:Para    | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code} | identifiers which contain ncbi.nlm.nih.gov/sra or ncbi.nlm.nih.gov/Traces/sra may only be issued by NCBI SRA (https://arctos.database.museum/agent/21349034)
    24 | MSB:Para    | identifiers which contain guid/MCZ:Orn may only be issued by MCZ:Orn (https://arctos.database.museum/agent/21355897)
    70 | MVZ:Bird    | collector number and preparator number may only be issued by person agents
     2 | MVZ:Egg     | collector number and preparator number may only be issued by person agents
     3 | MVZ:Fish    | collector number and preparator number may only be issued by person agents
   253 | MVZ:Herp    | collector number and preparator number may only be issued by person agents
   805 | MVZ:Mamm    | collector number and preparator number may only be issued by person agents
     1 | MVZ:Mamm    | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
     4 | NMMNH:Bird  | collector number and preparator number may only be issued by person agents
     2 | NMU:Para    | Local identifiers may not have issued_by_agent_id
     1 | OGL:Genomic | MCZ:Mala may only issue identifiers of the pattern http[s]://mczbase.mcz.harvard.edu/guid/MCZ:Mala:{code}
     1 | UAMb:Herb   | NCBI BioSample may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/biosample/{code}
   322 | UAM:Ento    | BoLD may only issue identifiers of the pattern http[s]://[www].boldsystems.org/index.php/Public_RecordView?processid={code} or http[s]://[www].boldsystems.org/connectivity/specimenlookup.php?processid={code}
     1 | UAM:Herb    | Local identifiers may not have issued_by_agent_id
     1 | UAM:Mamm    | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
    41 | UAM:Mamm    | Local identifiers may not have issued_by_agent_id
     1 | UAMObs:Ento | BoLD may only issue identifiers of the pattern http[s]://[www].boldsystems.org/index.php/Public_RecordView?processid={code} or http[s]://[www].boldsystems.org/connectivity/specimenlookup.php?processid={code}
   133 | UCM:Bird    | collector number and preparator number may only be issued by person agents
  3193 | UCM:Mamm    | collector number and preparator number may only be issued by person agents
     3 | UCM:Mamm    | identifiers which contain guid/MCZ:Mamm may only be issued by MCZ:Mamm (https://arctos.database.museum/agent/21355896)
     4 | UMNH:Herp   | Local identifiers may not have issued_by_agent_id
    51 | UMZM:Bird   | collector number and preparator number may only be issued by person agents
     6 | UMZM:Egg    | collector number and preparator number may only be issued by person agents
   164 | UMZM:Mamm   | collector number and preparator number may only be issued by person agents
     1 | UTEP:Ento   | collector number and preparator number may only be issued by person agents
     1 | UTEP:Herb   | identifiers which contain ncbi.nlm.nih.gov/nuccore may only be issued by GenBank (https://arctos.database.museum/agent/21349032)
     1 | UTEP:Herp   | identifiers which contain ncbi.nlm.nih.gov/nuccore may only be issued by GenBank (https://arctos.database.museum/agent/21349032)
     1 | UTEP:Inv    | MCZ:Mala may only issue identifiers of the pattern http[s]://mczbase.mcz.harvard.edu/guid/MCZ:Mala:{code}
    14 | UWBM:Mamm   | collector number and preparator number may only be issued by person agents
     1 | UWYMV:Fish  | collector number and preparator number may only be issued by person agents
     2 | UWYMV:Herp  | collector number and preparator number may only be issued by person agents
     3 | UWYMV:Mamm  | collector number and preparator number may only be issued by person agents
campmlc commented 2 weeks ago

So what happens to the identifiers in the attached file? I have tried several times to get rid of the issued by agent assigned to NK numbers, but I cannot delete the agent. I get the error below. It seems this solution prevents anyone from making any corrections to these data?

ERROR_ID | 4DB4125E-11C1-4BF3-8AC12707F81F0790 -- | -- ERROR_TYPE | SQL ERROR_MESSAGE | ERROR: Local identifiers may not have issued_by_agent_id Where: PL/pgSQL function trigger_fct_coll_obj_data_check() line 20 at RAISE ERROR_DETAIL |   ERROR_SQL | UPDATE coll_obj_other_id_num SET other_id_type='NK', display_value='282832', id_references='self', issued_by_agent_id=21298561, remarks=null WHERE COLL_OBJ_OTHER_ID_NUM_ID=17754022
campmlc commented 2 weeks ago

With these restrictions in place it is no longer possible to make any additions or edits to any identifiers in these records without triggering the error. The only solution CMs have is to delete the entire identifier and re-enter, losing metadata. I suggest that going forward any restriction that would prevent management at the collection level be presented at the AWG and notification be sent to affected collections to allow cleanup prior to implementation. @mkoo

dustymc commented 2 weeks ago

It looks like the agent pick doesn't allow selecting NULL. I can add that, and there is always an offer to help clean data. Would you like me to remove issued by from the NKs that have it?

campmlc commented 2 weeks ago

Yes, please allow selecting NULL, and yes, please remove issued by from the NKs that have it, thanks. We still need to reach out to collections with broken links to give them the opportunity to make corrections, through a targeted email.

dustymc commented 2 weeks ago

Next release:

Screenshot 2024-08-28 at 08 27 44 Screenshot 2024-08-28 at 08 27 52 Screenshot 2024-08-28 at 08 28 00
dustymc commented 2 weeks ago

remove issued by from the NKs temp_nk_ib.csv UPDATE 4

campmlc commented 2 weeks ago

Problem see #8054