Closed dustymc closed 2 weeks ago
I think this probably makes sense. It would be good to see an updated version what the problems are currently.
updated version
https://github.com/ArctosDB/arctos/issues/7808 will make it much easier to get what caught my eye, but I think it's probably not possible to know what the problems are until they become problematic. It's hard to imagine that random agents doing random stuff won't find a way to get weird, but IDK if that ==> "problem."
OK, dug around a bit, yea mess.
https://arctos.database.museum/search.cfm?id_issuedby=%3DWoodland%20Park%20Zoo&oidtype=collector%20number - collector number "refers to a person's field catalog" that's just wrong (but possibly/hopefully in some way which doesn't much change functionality).
https://arctos.database.museum/search.cfm?id_issuedby=%3DBeverly+J.+Witte&oidtype=field+number - don't think VP uses 'unique number assigned to a collecting event'
Lots of https://github.com/ArctosDB/arctos/issues/7836-involved messes (people are not institutions....)
Here's some data, maybe it'll lead somewhere, this seems to be enough to understand that the intersection of type and agent is at least sometimes arbitrary, which seems a bit sub-optimal to me:
From that I noticed 500+ malformed (==broken) genbank links that are attributed to the correct agent (so controllable/detectable)
arctosprod@arctos>> select count(*) from coll_obj_other_id_num where issued_by_agent_id=21349032 and substr(display_value,0,37) != 'http://www.ncbi.nlm.nih.gov/nuccore/' ;
count
-------
571
Some of them don't seem to be that at all, BUT NCBI seems to have a pretty good 404 handler - I guessed that SRR19593543 (https://arctos.database.museum/search.cfm?id_issuedby=%3DNCBI%20Nucleotide%20(GenBank)&oidnum=%3DSRR19593543) should be http://www.ncbi.nlm.nih.gov/nuccore/SRR19593543 (it should not) and got magicked to https://www.ncbi.nlm.nih.gov/sra/SRR19593543 (which should have been issued by https://arctos.database.museum/agent/21349034, not https://arctos.database.museum/agent/21349032).
So yea, clearly a problem, at least partially involving our inability to complete the migration. I'll bump priority.
collector number "refers to a person's field catalog"
That's how we use it here at the UWYMV, and since it refers to a person's individual catalog I would think you would want an Agent to have be the one associated with that collector number. The list you provided seemed to have a lot of examples of how I thought the new system was supposed to be used?
lot of examples of how I thought the new system was supposed to be used?
Yup, most of the list seems to be just fine, and yay us for that. There are no filters on my query, that's just every agent who's issued anything and they types they've issued, a first-step exploratory view.
Oh phew. I thought things were changing again.
Thanks-- I took a look at the google spreadsheet. It's helpful but also seems to include a lot of legit data too. Here's one way to view the data: Organizations vs People
In Orgs, these types (identifier and institutional catalog number) are redundant so merge/ cleanup seems simple.
For people, many have both collector number and preparator number. That's a curatorial distinction that I foresee we not going to be able to solve here in the presentday. Maybe in the future. Ideally we'd just call it catalog number or personal catalog or something else since obviously it's usually the same series of numbers-- it's just traditional to carry a little more distinction that someone just prep the specimen vs did field work (thus, do you bother looking for a field journal or not?) We do have better ways to distinguish of course (agent roles) but for the identifier_type? Not the best place for that! But sure, I predict a fight to retain this ancient methodology.
So maybe we limit people to just two types moving forward And proposed clean-up and remove the low frequency ones like processing number, field number, etc. I suggest the CT committee review tomorrow
lot of legit data
Yes, https://github.com/ArctosDB/arctos/issues/7837#issuecomment-2148081522
(identifier and institutional catalog number) are redundant
Yes, these are used completely interchangeably/arbitrarily, see also https://github.com/ArctosDB/arctos/issues/7836, cleaning up any bit of this mess brings clarity to the rest.
people, many have both collector number and preparator number.
Yep, no problem.
field number
Yea, I have no idea what to do about that. 99.9999% of the usage outside fish collections is just wrong, but I'd not want to try to write code to catch that either. Call it small fry and ignore for now...
Organizations vs People
Interesting, and that would catch much of the most-obvious "this can't possibly be right...." usage. BUT....
See eg https://github.com/ArctosDB/arctos/issues/7649, there are a ton of clearly-not-people agents entered as people, our not-great data kinda always contaminates something else...
fish collections ... small fry
I see what you did there....
https://github.com/ArctosDB/arctos/issues/7649#issuecomment-2195040148 + https://github.com/ArctosDB/arctos/issues/7836 ==> https://arctos.database.museum/guid/MSB:Mamm:145728
A low-quality person-agent acting as an institution doesn't seem optimal.
I'm going next task on this, I can't see any possible drawbacks, there are clear problems in existing data that this would prevent.
This happened in just this week in our collections where a CM added themselves as both Issued by and assigned by agent to a USGS record with the ID type as "identifier". The USGS ID is actually an "institutional catalog number" but we've been discouraged from using that. Had that been used instead of "identifier", it would have been clear that the USGS number was issued by the institution but assigned by a person. But with the use of "identifier" as the ID type, people are getting confused, and rightly so.
This looks like it's going to work reasonably well, and can be expanded as needed. I'll get it integrated with the triggers for next release.
There's some first-pass data in https://docs.google.com/spreadsheets/d/1zJXr-UTYc5fyNpVp80z7x0cNHkie71I-HmQO028Fll4/edit?gid=1602420022#gid=1602420022, summary below. Please let me know of any problems in the check, and of course if I can help clean anything up. (No cleanup is necessary, data which doesn't follow appropriate patterns just won't be able to save after this goes through.)
count | guid_prefix | check_status
-------+-------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
2 | BSUNH:Mamm | collector number and preparator number may only be issued by person agents
37 | CHAS:Bird | collector number and preparator number may only be issued by person agents
1 | CHAS:Mamm | identifiers which contain guid/MCZ:Mamm may only be issued by MCZ:Mamm (https://arctos.database.museum/agent/21355896)
1 | CHAS:Teach | collector number and preparator number may only be issued by person agents
6 | CRCM:Mamm | collector number and preparator number may only be issued by person agents
30 | DMNS:Bird | collector number and preparator number may only be issued by person agents
32 | DMNS:Inv | collector number and preparator number may only be issued by person agents
2 | DMNS:Mamm | BoLD may only issue identifiers of the pattern http[s]://[www].boldsystems.org/index.php/Public_RecordView?processid={code} or http[s]://[www].boldsystems.org/connectivity/specimenlookup.php?processid={code}
4 | DMNS:Mamm | collector number and preparator number may only be issued by person agents
1 | DMNS:Mamm | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
11 | DMNS:Mamm | identifiers which contain ncbi.nlm.nih.gov/bioproject may only be issued by NCBI BioProject (https://arctos.database.museum/agent/21349072)
1 | DMNS:Mamm | identifiers which contain ncbi.nlm.nih.gov/bioproject may only be issued by NCBI BioProject (https://arctos.database.museum/agent/21349072) | NCBI BioSample may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/biosample/{code}
10 | DMNS:Para | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
12 | HSUVM:Mamm | collector number and preparator number may only be issued by person agents
207 | KSB:Mamm | collector number and preparator number may only be issued by person agents
1 | KSB:Teach | collector number and preparator number may only be issued by person agents
1 | MLZ:Bird | collector number and preparator number may only be issued by person agents
180 | MMNH:Bird | collector number and preparator number may only be issued by person agents
57 | MMNH:Edu | collector number and preparator number may only be issued by person agents
6 | MMNH:Mamm | collector number and preparator number may only be issued by person agents
4 | MSB:Bird | Local identifiers may not have issued_by_agent_id
174 | MSB:Host | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
1 | MSB:Host | identifiers which contain guid/MCZ:Mamm may only be issued by MCZ:Mamm (https://arctos.database.museum/agent/21355896)
130 | MSB:Mamm | BoLD may only issue identifiers of the pattern http[s]://[www].boldsystems.org/index.php/Public_RecordView?processid={code} or http[s]://[www].boldsystems.org/connectivity/specimenlookup.php?processid={code}
2 | MSB:Mamm | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
1 | MSB:Mamm | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code} | identifiers which contain ncbi.nlm.nih.gov/biosample may only be issued by NCBI BioSample (https://arctos.database.museum/agent/21348953)
1 | MSB:Mamm | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code} | identifiers which contain ncbi.nlm.nih.gov/sra or ncbi.nlm.nih.gov/Traces/sra may only be issued by NCBI SRA (https://arctos.database.museum/agent/21349034)
1 | MSB:Mamm | identifiers which contain ncbi.nlm.nih.gov/nuccore may only be issued by GenBank (https://arctos.database.museum/agent/21349032)
216 | MSB:Para | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
1 | MSB:Para | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code} | identifiers which contain ncbi.nlm.nih.gov/sra or ncbi.nlm.nih.gov/Traces/sra may only be issued by NCBI SRA (https://arctos.database.museum/agent/21349034)
24 | MSB:Para | identifiers which contain guid/MCZ:Orn may only be issued by MCZ:Orn (https://arctos.database.museum/agent/21355897)
70 | MVZ:Bird | collector number and preparator number may only be issued by person agents
2 | MVZ:Egg | collector number and preparator number may only be issued by person agents
3 | MVZ:Fish | collector number and preparator number may only be issued by person agents
253 | MVZ:Herp | collector number and preparator number may only be issued by person agents
805 | MVZ:Mamm | collector number and preparator number may only be issued by person agents
1 | MVZ:Mamm | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
4 | NMMNH:Bird | collector number and preparator number may only be issued by person agents
2 | NMU:Para | Local identifiers may not have issued_by_agent_id
1 | OGL:Genomic | MCZ:Mala may only issue identifiers of the pattern http[s]://mczbase.mcz.harvard.edu/guid/MCZ:Mala:{code}
1 | UAMb:Herb | NCBI BioSample may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/biosample/{code}
322 | UAM:Ento | BoLD may only issue identifiers of the pattern http[s]://[www].boldsystems.org/index.php/Public_RecordView?processid={code} or http[s]://[www].boldsystems.org/connectivity/specimenlookup.php?processid={code}
1 | UAM:Herb | Local identifiers may not have issued_by_agent_id
1 | UAM:Mamm | GenBank may only issue identifiers of the pattern http[s]://[www].ncbi.nlm.nih.gov/nuccore/{code}
41 | UAM:Mamm | Local identifiers may not have issued_by_agent_id
1 | UAMObs:Ento | BoLD may only issue identifiers of the pattern http[s]://[www].boldsystems.org/index.php/Public_RecordView?processid={code} or http[s]://[www].boldsystems.org/connectivity/specimenlookup.php?processid={code}
133 | UCM:Bird | collector number and preparator number may only be issued by person agents
3193 | UCM:Mamm | collector number and preparator number may only be issued by person agents
3 | UCM:Mamm | identifiers which contain guid/MCZ:Mamm may only be issued by MCZ:Mamm (https://arctos.database.museum/agent/21355896)
4 | UMNH:Herp | Local identifiers may not have issued_by_agent_id
51 | UMZM:Bird | collector number and preparator number may only be issued by person agents
6 | UMZM:Egg | collector number and preparator number may only be issued by person agents
164 | UMZM:Mamm | collector number and preparator number may only be issued by person agents
1 | UTEP:Ento | collector number and preparator number may only be issued by person agents
1 | UTEP:Herb | identifiers which contain ncbi.nlm.nih.gov/nuccore may only be issued by GenBank (https://arctos.database.museum/agent/21349032)
1 | UTEP:Herp | identifiers which contain ncbi.nlm.nih.gov/nuccore may only be issued by GenBank (https://arctos.database.museum/agent/21349032)
1 | UTEP:Inv | MCZ:Mala may only issue identifiers of the pattern http[s]://mczbase.mcz.harvard.edu/guid/MCZ:Mala:{code}
14 | UWBM:Mamm | collector number and preparator number may only be issued by person agents
1 | UWYMV:Fish | collector number and preparator number may only be issued by person agents
2 | UWYMV:Herp | collector number and preparator number may only be issued by person agents
3 | UWYMV:Mamm | collector number and preparator number may only be issued by person agents
So what happens to the identifiers in the attached file? I have tried several times to get rid of the issued by agent assigned to NK numbers, but I cannot delete the agent. I get the error below. It seems this solution prevents anyone from making any corrections to these data?
ERROR_ID | 4DB4125E-11C1-4BF3-8AC12707F81F0790 -- | -- ERROR_TYPE | SQL ERROR_MESSAGE | ERROR: Local identifiers may not have issued_by_agent_id Where: PL/pgSQL function trigger_fct_coll_obj_data_check() line 20 at RAISE ERROR_DETAIL | ERROR_SQL | UPDATE coll_obj_other_id_num SET other_id_type='NK', display_value='282832', id_references='self', issued_by_agent_id=21298561, remarks=null WHERE COLL_OBJ_OTHER_ID_NUM_ID=17754022With these restrictions in place it is no longer possible to make any additions or edits to any identifiers in these records without triggering the error. The only solution CMs have is to delete the entire identifier and re-enter, losing metadata. I suggest that going forward any restriction that would prevent management at the collection level be presented at the AWG and notification be sent to affected collections to allow cleanup prior to implementation. @mkoo
It looks like the agent pick doesn't allow selecting NULL. I can add that, and there is always an offer to help clean data. Would you like me to remove issued by from the NKs that have it?
Yes, please allow selecting NULL, and yes, please remove issued by from the NKs that have it, thanks. We still need to reach out to collections with broken links to give them the opportunity to make corrections, through a targeted email.
Next release:
remove issued by from the NKs temp_nk_ib.csv UPDATE 4
Problem see #8054
Is your feature request related to a problem? Please describe.
See https://github.com/ArctosDB/arctos/issues/7808#issuecomment-2139730876
"Collection agents" (those who wear a collectionID) are issuing all sorts of identifiers. Most of this does not seem realistic to me. Probably other agents are also being recorded as issuing identifiers which they did not actually issue.
Describe what you're trying to accomplish
Describe the solution you'd like
Some mechanism to control what identifiers certain agents may issue.
Describe alternatives you've considered
Do nothing.
Additional context
I can dig around in the data if there's any interest in proceeding with this.
Priority
Not clear; preventing future tangled messes seems important to me, this has caused some problems already (https://github.com/ArctosDB/arctos/issues/7025), but both actionable (eg URLs) and local (eg, NK) identifiers can mostly function without this information so filtering is possibly of relatively little value.