Open dustymc opened 3 months ago
There is a bug (squashed for next release) on that pathway, but I can't make it skip the duplicate notification. There may also be some complications when the duplicate has no name components, just a preferred name - allowing really low-quality data definitely has some not-great influence on future actions. Anyway, things should be slightly better in the near future, details on whatever lead to any sort of problem are always appreciated, thanks!
need something a little more automated
It was a different system, this isn't a "no," but: way back when we did that, it turned out to mostly be a very useful way to make problems immortal. Somehow I think we'd need a 'careful person saving only the good stuff' filter in there, I'm not sure how we might do that.
a 'careful person saving only the good stuff' filter in there, I'm not sure how we might do that.
A download of attributes from the bad duplicate that could be uploaded to the good would be better than a ton of copy-pasta.
@AJLinn any idea why this happened?
Those were created 2 minutes apart from one another. If I remember right, I might have clicked out of the agent creation pop-up to look at something else relating to the agent name, and the pop-up window closed but must have saved without hitting create agent. So I went back in and created the agent again.
https://docs.google.com/spreadsheets/d/1sosC-w8xHpyXD0g_x1n2-ub37jEe_CK39cj-mtic9dk/edit#gid=384759468 is a spreadsheet of agents who share first and last name. The temp_agent_share_firstlast_mc tab (multiple character names only) is probably sufficiently overwhelming to decide if we'd like to do anything about any of this or not. There are definitely a few agents I picked up in a quickish skim which suggest https://github.com/ArctosDB/arctos/issues/7649#issuecomment-2102998471 (eg we are failing both operators and contributors with our lack of training-or-something).
One of these has been marked as a duplicate, but not by the creator. Disallowing shared email addresses would address some of this:
https://arctos.database.museum/agent/21352394 https://arctos.database.museum/agent/21352392 https://arctos.database.museum/agent/21352393
Maybe John just really dislikes his last name, but it still lead to a duplicate
https://arctos.database.museum/agent/21352170 https://arctos.database.museum/agent/21352628
??????????? maybe something about the search UI ????????????????
https://arctos.database.museum/agent/21352600 https://arctos.database.museum/agent/21317611
These have the same name, same email address, created by the same person, and both have operator accounts! Even if we do nothing in the name of proper attribution, we should find some way to avoid this situation as a matter of security. And another case where the not-actually-emails are causing tangible problems.
https://arctos.database.museum/agent/21339858 https://arctos.database.museum/agent/21339906
2 minutes apart
I think that was probably related to https://github.com/ArctosDB/arctos/issues/7738. I just clicked as fast as I can and got the expected warning.
@Jegelewicz
I think some of these may be your students?
Yes, Elena is working with us at UAM:Herb. She's doing a super job doing detective work on Russian collectors, of which we have hundreds of names with little or no metadata. But some dups may slip through - I'll check.
@dustymc
here are a lot of agents being created with various could-be-important data stuffed into remarks.
Where else should these comments go? If all we know is that a name was a collector in a place at a time there are no additional agent relationships that can be made, but having some info in the remarks can help point another user in the right direction.
Remarks are expected to contain "I'm not sure how to spell 'pumpkin'" and "agent known to like tatertots" and EVERYTHING else - except the stuff which has typed fields, such as places and dates.
"There, then, doing that" in remarks is useful - to the maybe 1% of people who read remarks and are able to successfully figure out what it means.....
I messed with https://arctos.database.museum/agent/21352177 (and maybe made some bad assumptions, please review). Now that record contains....
Same information (assuming I didn't muck it up or make bad assumptions), organized so that it's MUCH more useful for all sorts of things, including finding those maybe-inevitable duplicates (eg "find ALA agents reported as doing stuff in 1984" just became possible).
Thanks @dustymc - that makes sense, and of course I agree about entering data into dedicated fields is always better than loose text. But... it's also a leap to say that the correspondence address of a person who collected in Russia is "..., ..., Russia". Maybe a new field 'active in' would be good. Or... better, an auto-generated list of countries and dates of records that the agent is associated with in Arctos.
Back to the larger, old issue: should we make agents at all? I reread the handbook and it's very clear that it's often better not to create agents at all. But... as @DerekSikes and others point out, verbatim agents don't play well with reports, labels. A single agent model (real or verbatim) is just easier to deal with and we're trying to use 'real' agents. That said, not making agents would speed up our transcription process hugely - I reckon ~50% of my own time spent on bulkloading data entered by assistants is spent on reconciling agents - if I pushed everything into verbatim agents it would be a doddle. I recently got Elena to start researching Russian agents, and she's doing great, but we simply don't have much info on the majority of names.
Perhaps a community-wide event is needed, as suggested above?
Maybe a crazy idea but how about: all agents in catalog records are verbatim; all agents in the agent table are real; if they match 100% then a relationship exists, if they do not, then it doesn't (but nothing bad happens, that is, no regulations against having v-agents in catalog record with no real agent matching).
All reports from catalog records would use the verbatim agent field of the catalog record.
correspondence address
Yea, there's an issue somewhere, I lost the argument that we need some less-addressey-address-thing, feel free to start it again, I'll agree with you!
auto-generated list of countries and dates of records that the agent is associated with in Arctos
That very nearly always has negative value.
https://github.com/ArctosDB/arctos/issues/7796 - pulling live data in whatever form - is of course fine, anyone can go check it in context.
If that's what you're doing then remarks is probably the best mechanism, method would be very useful, and my "interpretation" likely vastly overplays the available hand.
not making agents would speed up our transcription process hugely
My position (which lead to heavy verbatimization, which then somehow lead here) has not changed: Make agents if they DO STUFF for you, don't if they don't. If you know "John Doe" then you're losing nothing by using verbatim, it can easily carry all you have. If you know it's that John Doe then you need an agent-object to carry that information.
match 100%
Multiple people named John Doe have existed.
verbatim agents don't play well with
Not sure I buy that, there were no actionable requests to remap or such.
reports
I'm always happy to help with them, they can use whatever you want them to use.
A single agent model (real or verbatim) is just easier to deal with
That is at least the point that made sense to me, and clearly some agents do have information that verbatim can't carry (shipping addresses, ORCIDs, etc.) so here we are. I'd still use verbatim if I had "verbatim-level" data, but I'm not going to push anyone in that direction very enthusiastically either (pending guidance from The Community here, of course).
~50% of my own time spent on bulkloading data entered by assistants is spent on reconciling agents
And I reckon that's probably not very productive, because you're probably dealing with out-of-context strings. Much of the idea of verbatim was to delay that investigation until AFTER entry, when you have the context to notice (and can request tools to help you notice) the two John Does spend a lot of time in the same places, or have a huge temporal gap, or WHATEVER thing that's not generally available from the string "John Doe." Don't think there's anything hindering that right now, but I'm also not sure what level of resources I could devote to helping.
community-wide event
I'm begging for guidance here, yes please. If we want to set some quality standards then I can probably help with tools, if we don't then I've got plenty of other things to do!
verbatim agents don't play well with
Not sure I buy that, there were no actionable requests to remap or such.
Is there an existing SQL function (that you made) to concatenate agents and verbatim agents? ... for reports and labels?
Concatenation may be sufficient for many uses, but necessarily has to lose data about the order of collectors. If a specimen had three collectors: A, B and C (in that order) and its record has agents A and C, and verbatim agent B, then there is no way (other than remarks) to indicate the correct order of collectors - any concatenation will give A, C, and B.
I think we'll just push on with creating true agents, trying hard not to create duplicates or assign the wrong agent. It is time-consuming, but should create better overall information.
SQL function
There's one in https://arctos.database.museum/Reports/reporter.cfm?action=edit&report_id=85, lots of possibilities....
order of collectors
"Bugs Bunny and Elmer Fudd" is a perfectly cromulent verbatim collector....
agents A and C, and verbatim agent B,
If you know A and C then you can probably figure out B (even if it's just that they were some ephemeral being who probably doesn't have field notes), but sure, there are innumerable fringe cases where strings start having trouble carrying the load.
push on with creating true agents
Nobody seems to be suggesting otherwise here, seems reasonable.
should create better overall information.
I don't think that's the trend, but there are definitely defensible reasons to do that so rock on!
??
@wellerjes see the comment above. Can you help us figure out why this happens?
What I think happened - volunteer could not find "Davidson Brothers Marble Co." because it was not an AKA of the original "Davidson Brothers Marble Company". I was reviewing her work and updated the "Co." to Company, then added the AKA without realizing that there was already an agent named that. I'll remind our volunteers working on agents to try different searches before creating an agent.
I think this is what's happening with duplicate agents--if someone doesn't search with a % or searches "first name+last name" when the agent is only entered as "first name+middle name+last name" (with no AKAs) then they're not finding the correct agent. I've done this before. If the agent's name appears differently throughout the records (J. Weller vs. Jessica Weller vs. J. L. Weller could all be me) it's not always obvious to someone that the agent is the same person, which is why they might ignore the big red box that says "this might be a duplicate"
I'm just not sure how to change this behavior. If people are only going to search one thing and give up, this will keep happening.
Can we somehow make the search less strict and find near matches?
somehow make the search less strict and find near matches
There's a whole thread of me saying that would lead here and everyone insisting that they were getting too many matches somewhere....
There's a whole thread of me saying that would lead here and everyone insisting that they were getting too many matches somewhere....
Also fair because when there are too many, adds just get made. I don't think we can stop humans from being human, we can just keep asking everyone to try harder.
when there are too many, adds just get made
Yea, there's a whole 'nuther thread of me saying that low-quality data inspires low-quality data....
https://arctos.database.museum/agent.cfm?srch=Davidson%20Brothers%20Marble%20Co.&include_verbatim=false&include_bad_dup=true - "This is the search you're looking for." does not have the problem described, at least as I understand it. I don't know if that's a UI problem (something I might address) or a documentation/training problem (something The Community might address), or something else entirely.
There is some relevant documentation regarding "J. Weller vs. Jessica Weller vs. J. L. Weller":
A generic search, such as only a last name is preferred. This form is searching Agent Preferred Names, so a search for John Smith will not return the agent John H. Smith, but a search for Smith will return both.
https://handbook.arctosdb.org/how_to/How-to-Search-Agents.html
or a documentation/training problem (something The Community might address)
I think we have addressed it - the question is does anybody read or use documentation?
https://handbook.arctosdb.org/how_to/How-to-Create-Agents.html#before-creating-a-new-agent
does anybody read or use documentation?
That's the part we haven't addressed, training. Arctos is very hippy-commune-ish about how roles are handed out, maybe we've outgrown that. I'm not sure what exactly the alternative might be, but lots of things require some sort of training/testing/whatever and there must be thousands of models we could explore.
but a search for Smith will return both
A search for Smith will give you "CAUTION: Return limit exceeded, some data may be excluded. Please perform a more specific search to ensure accurate results."
I ran into this the other day searching for my volunteer Judy Miller, searching under Agent name for "Miller" I just about added her again, but was stopped when the agent creator found the agent I was looking for.
Yea, that's the other juggle-ball: I've got limited resources, I often don't have the capacity to send everything even when you might not get overwhelmed by it. Some of that's potentially fixable - eg do I really need to be including [all of whatever I'm currently including] in the 'anything' search, is there a better sort that might get "us" (unverified us - I'm already sorting by that) closer to the top, etc., etc.?
Duplicates:
agent_id | agent_type | preferred_agent_name | creator | created_date
----------+------------+----------------------+--------------------------+----------------------------
21346039 | person | David C. Evans | Joseph Hopkins | 2022-10-03 08:42:01.420229
21258378 | person | David C. Evans | unknown | 2013-12-16 21:49:31
21334283 | person | J. O. Sullivan | Teresa J. Mayfield-Meyer | 2021-09-07 16:29:24.897153
7604 | person | J. O'Sullivan | unknown | 2013-12-16 21:49:31
1017329 | person | LaRue | unknown | 2013-12-16 21:49:31
21253481 | person | La Rue | unknown | 2013-12-16 21:49:31
1011480 | person | L. VanHorn | unknown | 2013-12-16 21:49:31
1010287 | person | L. Van Horn | unknown | 2013-12-16 21:49:31
21256873 | person | Mary O'Donnel | unknown | 2013-12-16 21:49:31
21253957 | person | Mary O’Donnel | unknown | 2013-12-16 21:49:31
21257004 | person | Röner | unknown | 2013-12-16 21:49:31
21258795 | person | Rößner | unknown | 2013-12-16 21:49:31
21352433 | person | Tom Rickman | Derek S. Sikes | 2024-04-25 12:36:54.346303
21352432 | person | Tom Rickman | Jozef A. Slowik | 2024-04-25 12:09:25.013787
21352431 | person | Tom Rickman | Jozef A. Slowik | 2024-04-25 12:06:48.200331
Information only in remarks:
agent_id | preferred_agent_name | creator | created_date | attribute_value
----------+----------------------+---------------------+----------------------------+----------------------------------------------------------------------------
21352952 | Izak Veals | Paige Wilson Deibel | 2024-06-26 13:28:50.081707 | student at University of Washington, employee of Burke Museum
21352951 | Christina Stuhl | Paige Wilson Deibel | 2024-06-26 13:27:37.568765 | student at University of Washington, volunteer in Burke Museum Paleobotany
21352950 | Ray Cagnetta | Paige Wilson Deibel | 2024-06-26 13:25:58.536078 | employee of Burke Museum, museology student at University of Washington
21352949 | Ana Gutierrez | Paige Wilson Deibel | 2024-06-26 13:23:06.542398 | volunteer for Burke Museum Paleobotany
21352948 | Amanda Godfrey | Paige Wilson Deibel | 2024-06-26 13:22:10.115826 | paleobotany volunteer at Burke Museum
21352947 | Elena Stiles | Paige Wilson Deibel | 2024-06-26 12:44:56.034333 | paleobotanist, PhD student at University of Washington
21352916 | Lulu Gaustad | Angela Linn | 2024-06-21 16:41:38.110722 | UAM Ethnology and History
21352914 | Margen Burke Riley | Angela Linn | 2024-06-21 12:38:46.148758 | UAM Ethnology and History
21352897 | David M. Evans | Michelle S. Koo | 2024-06-17 20:56:55.810041 | associated with University of Wyoming in 1970s
21352897 | David M. Evans | Michelle S. Koo | 2024-06-17 20:56:55.810041 | UWYMV collector active in the 1970s
21352864 | Judith Price | Mariel L. Campbell | 2024-06-07 16:36:57.565727 | CMN
Duplicates
As for
and don't forget
It is hard for me to say if these are one person, two people, or three without more information from the collections.
It does feel like J. O'Sullivan is just a mis-transcription of John O. Sullivan but I have no definitive proof. The collecting locations differ for J. O. Sullivan and John O. Sullivan so again, I think I would need more information to decide if they are the same person. Maybe @mkoo can figure it out with whatever they have in the MVZ:Arch collection?
Re: Tom Rickman - here's what Slowik emailed me back on Apr 25: "Additionally, when I try to enter the collector, Tom Rickman, I tried to create the person as before but it errors out no matter what. "
and "Tom is alive. It's really quirky. If I try to add any additional info on him then it just errors out. If I don't enter any of the fields I get the option to force create and then it errors out. "
and I replied: "Ok, I made an agent record for Tom Rickman. I also was presented with some arctos weirdness and asked to force create which I did and it worked. Arctos agents is being re-tooled so there's all sorts of buggy behavior I hope they iron out fast!"
And then: "Well I got the boxes to turn green but the errors still exist anywhere I put Tom's name. Ideas?
2024-4-25T10:45:30: FAIL: agent_1_name [ Tom Rickman ] is invalid; record_event_determiner [ Tom Rickman ] matches 0 agents; locality_attribute_1_determiner [ Tom Rickman ] matches 0 agents: {"message":"agent_1_name [ Tom Rickman ] is invalid; record_event_determiner [ Tom Rickman ] matches 0 agents; locality_attribute_1_determiner [ Tom Rickman ] matches 0 agents","status":"fail"}
So not user error. Just users trying to get Arctos to behave!
So not user error. Just users trying to get Arctos to behave!
That error exists because there is more than one Tom Rickman and Arctos doesn't know which one to choose.
That might explain the later error but not the former before the agent was made (during the process of trying to make the first one)
Information only in remarks:
![]()
Can someone explain to me what this comment is about in terms of the issue of "agent guardrails" - I just added some additional information in both of those records but even prior to that they both had relationships with other established agents.
explain
Bad timing, I was on the wrong server, I fired off the wrong script, my script is broken (please let me know if so)..... who knows, if the good stuff isn't solely in remarks then yay everybody.
And my primary purpose here is still https://github.com/ArctosDB/arctos/issues/7649#issue-2232072678, I'm just gathering some examples, seeing what might be possible, what The Community would like (and if we can figure out how to do that), etc. - if I'm questioning something that you think is OK, PLEASE let me know that too.
Why? https://github.com/ArctosDB/arctos/issues/7894, immediately. There's some data in there that I think possibly shouldn't be loaded (but HOW?), maybe it's fine, maybe my standards are weird, maybe I'm not being paranoid enough, who knows, none of those are decisions that any of us want to make alone, HELP! (And it's all complicated by a bunch of us simultaneously experiencing personal issues, we're not ignoring you @javanveldhuizen!)
Here's fresh data with something in remarks, no relationships, no events, created in the last month.
agent_id | preferred_agent_name | creator | created_date | attribute_value
----------+----------------------+---------------------+----------------------------+----------------------------------------------------------------------------
21352952 | Izak Veals | Paige Wilson Deibel | 2024-06-26 13:28:50.081707 | student at University of Washington, employee of Burke Museum
21352951 | Christina Stuhl | Paige Wilson Deibel | 2024-06-26 13:27:37.568765 | student at University of Washington, volunteer in Burke Museum Paleobotany
21352950 | Ray Cagnetta | Paige Wilson Deibel | 2024-06-26 13:25:58.536078 | employee of Burke Museum, museology student at University of Washington
21352949 | Ana Gutierrez | Paige Wilson Deibel | 2024-06-26 13:23:06.542398 | volunteer for Burke Museum Paleobotany
21352948 | Amanda Godfrey | Paige Wilson Deibel | 2024-06-26 13:22:10.115826 | paleobotany volunteer at Burke Museum
21352947 | Elena Stiles | Paige Wilson Deibel | 2024-06-26 12:44:56.034333 | paleobotanist, PhD student at University of Washington
21352897 | David M. Evans | Michelle S. Koo | 2024-06-17 20:56:55.810041 | associated with University of Wyoming in 1970s
21352897 | David M. Evans | Michelle S. Koo | 2024-06-17 20:56:55.810041 | UWYMV collector active in the 1970s
21352864 | Judith Price | Mariel L. Campbell | 2024-06-07 16:36:57.565727 | CMN
(9 rows)
@dustymc script still not working?
See https://github.com/ArctosDB/arctos/issues/7649#issuecomment-2195180711
Izak Veals definitely has stuff other than remarks.
working
I'm trying to figure out what that means! I added relationships, data below. I think I was trying to avoid derived data, the take-home (if The Community wants to consider this in any way) is that I probably can't exclude low-value relationships. I can't see much way to separate "they actually hang around here, we know this person" and "a vaguely similar name is scribbled on something that once passed through here for some reason." (So https://github.com/ArctosDB/arctos/issues/7649#issuecomment-2163522019 still looks worth investigation.)
agent_id | preferred_agent_name | creator | created_date | attribute_value
----------+----------------------+--------------------+----------------------------+------------------------------------------------
21352897 | David M. Evans | Michelle S. Koo | 2024-06-17 20:56:55.810041 | UWYMV collector active in the 1970s
21352897 | David M. Evans | Michelle S. Koo | 2024-06-17 20:56:55.810041 | associated with University of Wyoming in 1970s
21352864 | Judith Price | Mariel L. Campbell | 2024-06-07 16:36:57.565727 | CMN
(3 rows)
Do we need rules or guidance around agents?
I've noticed some not-great agent data being created, I don't know if @ArctosDB/agents-committee would care to attempt to establish any guardrails or if this is fine or ?? Please advise, or close if nobody cares.
Possible Actions
manage_agents
accessExamples and ponderings and such follow
Here are agents with nonunique preferred name:
and recent person-agents - many of which are clearly not persons - without a first or last name: