ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
60 stars 13 forks source link

Agents - disambiguation of duplicate agents and workflow of data migration - workflow needs help #6114

Closed Jegelewicz closed 8 months ago

Jegelewicz commented 1 year ago

An incoming collection provided me with a list of people/organization names in use in their current data. There were 921 agents. After some eyeballing and review, I combined some duplicates and narrowed down the list to 794. A lot of parsing and adding periods later, I had a file that included all of the preferred, first, middle, last and for some, akas, which I ran through the Agent prebulkloader. The results are there if you are brave enough to look.

I can have the incoming collection go through this list as is, or I can try to help a bit. This can be a completely overwhelming task; for a new collection because - well it is a lot, and for me because I do not know the collection, so it is difficult for me to make assumptions about whether their Benjamin M. Fitzpatrick is the same person as Ben M. Fitzpatrick or Ben Fitzpatrick (and never mind that those two Arctos agents might be the same person)?

image

Multiply this decision tree by about 100 (534 of the names I ran through are NOT in Arctos and have no close matches - which means they will be verbatim agents unless the incoming collection can provide one piece of identifying information - which they will also have to go find and 110 have an exact match in Arctos - but are we SURE that they are the same person? How hard should we look into that?)

The results of this process when downloaded from the tool cannot be processed easily as there are line breaks in the status column - ideally, each piece of "advice" would be separated by something I could use to parse them in Excel - something that isn't used in any of the names or "advice" text. @dustymc can we do something different there? I find that reviewing them in the tool is potentially easier - but very time consuming and not easy to do in a modular way. The only way to find all of the [fatal]|nocase preferred name match: quickly is to download the results and use the FIND feature in Excel or whatever tool you choose to review csv. Otherwise, you see this buried somewhere in the status.

Here is the status column for Chapman in the Prebulkloader

[advisory]|Do not create unnecessary variations of unknown: preferred name is one word|{} [advisory]|nodots-nospaces match on agent name: Frank L. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=1015714} [advisory]|nodots-nospaces match on agent name: Frank M. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21250708} [advisory]|nodots-nospaces match on agent name: Anne Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21251652} [advisory]|nodots-nospaces match on agent name: Sherry P. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21251907} [advisory]|nodots-nospaces match on agent name: C. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21265068} [advisory]|nodots-nospaces match on agent name: Sydney Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21280429} [advisory]|nodots-nospaces match on agent name: Henry Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21280982} [advisory]|nodots-nospaces match on agent name: Tyler R. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21282652} [advisory]|nodots-nospaces match on agent name: E. M. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21283913} [advisory]|nodots-nospaces match on agent name: C. Grant|{https://arctos.database.museum/agents.cfm?agent_id=21285910} [advisory]|nodots-nospaces match on agent name: K. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21294072} [advisory]|nodots-nospaces match on agent name: Frank C. Bellrose|{https://arctos.database.museum/agents.cfm?agent_id=21297193} [advisory]|nodots-nospaces match on agent name: R. W. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21298328} [advisory]|nodots-nospaces match on agent name: B. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21302400} [advisory]|nodots-nospaces match on agent name: Arthur O. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21306686} [advisory]|nodots-nospaces match on agent name: Chapman Grant|{https://arctos.database.museum/agents.cfm?agent_id=21321193} [advisory]|nodots-nospaces match on agent name: Dwight Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21321216} [advisory]|nodots-nospaces match on agent name: John Chapman Frye|{https://arctos.database.museum/agents.cfm?agent_id=21325592} [advisory]|nodots-nospaces match on agent name: Brian R. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21326631} [advisory]|nodots-nospaces match on agent name: Quinton T. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21327742} [advisory]|nodots-nospaces match on agent name: Olivia S. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21332306} [advisory]|nodots-nospaces match on agent name: Bethany G. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21332456} [advisory]|nodots-nospaces match on agent name: Destiny R. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21332477} [advisory]|nodots-nospaces match on agent name: Erik W. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21341127} [advisory]|nodots-nospaces match on agent name: H. B. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21347045} [advisory]|nodots-nospaces match on agent name: Mrs. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=21347070} [advisory]|nodots-nospaces match on agent name: Richard C. Chapman|{https://arctos.database.museum/agents.cfm?agent_id=5304} [advisory]|nodots-nospaces match on agent name: Chapman|{https://arctos.database.museum/agents.cfm?agent_id=605} [fatal]|nocase preferred name match: Chapman|{https://arctos.database.museum/agents.cfm?agent_id=605},At least one address, status, or relationship is required

Note that it is only at the very bottom of this extensive list of possible matches that I find [fatal]|nocase preferred name match: Chapman|{https://arctos.database.museum/agents.cfm?agent_id=605})

Placing the burden of cleaning up agents on incoming collections seems a bit unfair. As long as we continue to have agents like Chapman- this is going to be a difficult process. Any chance we can take the next step in removing low quality agents and verbatimize ALL of those that don't have any identifying information? Any other ideas for making this better/easier?

Help!

campmlc commented 1 year ago

Good grief. I support doing whatever necessary to make this process functional, starting with changes to the way status column links are displayed.

dustymc commented 1 year ago

Three(??) items here, I think.

  1. The big revelation for me has been that this is the worst possible time to create Agents - they're out of the context of the data they can make sense in! How could you know how many Ben [M.] Fitzpatricks are involved (at least without going off and digging through some excel-or-something-probably)?? Pre-creating Agents should be minimized as much as possible, whatever that means. Any agents who are to be created should be accompanied by decent data.

  2. Arctos cleanup has not progressed to the point where really cool things look easy, and seems to have stalled. Ben should be marked for merge. One person probably wasn't murdering turtles in 1950 (from Arctos) and then getting a doctorate in 2004 (from ORCID) so they're probably not the same (and the DMNS kingfisher-killer is probably yet another person @acdoll ) so this might not be completely trivial, but we should have some way to either attach more data to Ben or to purge him to verbatim agent land. Not a clue how we do that, but this is going to be difficult until we do. One component is that we are still creating 'just above the bar' Agents - way too many people who do not seem to understand the goals have create permissions. Seems like a job for @ArctosDB/agents-committee to me....

  3. Data format: There's an issue somewhere, I'm doing horrible things to the CSV by request, but hopefully the above means that doesn't really matter....

Jegelewicz commented 1 year ago

worst possible time to create Agents

That may be true - but agents are required for Accessions which are required for catalog records. It seems like twice the work to process only Accession agents then do some more when you need agents for identifications and other determinations. People are doing everything in Arctos and we really only have a consistent path for "verbatimizing" collectors and preparators.

DerekSikes commented 1 year ago

My inclination is to allow duplicate agents and disambiguate them using a code but also tag them as 'imported from UAIT collection on 2023-04-11' so if there are 12 different James Vincents each has some info separating them & each collection can find their OWN James Vincent and if anyone cares to do the research and discovers that 3 of the 12 are the same they can be merged later, or not, but work proceeds. Just my 2 cents.

mkoo commented 1 year ago

I agree with @DerekSikes We need progress so allow the dups and we can figure it out later (if at all-- might take a while). And more importantly we can develop tools to deal with disambiguation but we still need the data in Arctos to do it.

Initial collection creation and migration is not necessarily the best time to handle deep agent cleaning. (ok, please ignore how that sounds)

dustymc commented 1 year ago

Yes, plenty of details to work out.

Yes, there are efficiencies in having all 96 forms of some name in one file and dealing with them one time. I suspect having the context of the data available is a (much) bigger efficiency, but ??? I'm pretty confident that 'H. P. H.' isn't useful/resolvable at this point, beyond that who knows.....

One sorta-obvious improvement would be to have the checker do more with the relationships and such. I think those are completely ignored (other than needing to exist at some point), they should DO STUFF. (And maybe that's a good point to decide if this can be supported by the component loader environment or needs something more specialized.)

allow duplicate agents

That's basically what verbatim agent does (and maybe that idea needs extended in some way).

12 different James Vincents

If the agent data is garbage, then it's garbage for everyone and that has hard functional implications. Maybe there is some "second-class agent" structure-or-something, but if so it's something way beyond mixing low-quality junk in with the stuff that we've invested so much time in cleaning. So yes, Arctos should absolutely support 12 James Vincents - as long as they're all disambiguated by the data they carry and a user selecting one won't have any trouble figuring out which one is correct.

Jegelewicz commented 1 year ago

I agree with @dustymc I don't want to go backward - I'd like to proceed with cleaning up Agent messes so that this ISN'T so difficult. We still have A TON of very low-quality agents and they are a large part of what makes this difficult. Perhaps we can at least start with a little tweaking of the responses from the tool or maybe we just need a coarse first pass because to start, I'd like to put the list of incoming names into three categories:

Has an exact match in Arctos Has NO match in Arctos Has some potential matches in Arctos

Then I can take those three bunches and review them appropriately

Are the exact matches the incoming collection's agent?

Are the no matches worth creating an agent for?

Potential alternates These require review, name by name, there really isn't a better path right now. A report that summarized activity dates and collection types would be useful. Do you find a match?

We need to ensure that using verbatim agent is possible for determiners of all kinds and have people feel comfortable using that (not sure we are there right now), we also need to to be able to use verbatim agents in transactions....

campmlc commented 1 year ago

This sounds like a good approach. Can we get the tool to break down the feedback into these categories?

dustymc commented 1 year ago

proceed with cleaning up Agent messes so that this ISN'T so difficult. We still have A TON of very low-quality agents and they are a large part of what makes this difficult.

Yup, and being stuck in the middle (why do we always end up here?!) is going to make creating clean difficult which makes cleanup more difficult which... positive feedback loops suck.

Can we get the tool to break down the feedback into these categories?

I suppose, but I don't think it can be meaningful (yet, I hope). You can get one exact match (because your "A." matched the existing "A." and if you use that you'll get eventually sucked into some horrid cleanup), and no name matches (because there's a typo in Arctos) and everything else you can imagine, and lots of things that nobody could see coming. Hopefully that'll all change once there's more cleanup, but as long as we've got (2) floating around, this is going to be weird.

If we ever get cleaned up, then "ORCID matches nobody cares how you spell it" and "Not that John Doe because birth dates don't match" and such become possible, and maybe that environment would support some more automation.

mkoo commented 1 year ago

Returning to this and specifically @dustymc summary of factors at play here:

Three(??) items here, I think.

1. The big revelation for me has been that this is the worst possible time to create Agents - they're out of the context of the data they can make sense in! How _could_ you know how many Ben [M.] Fitzpatricks are involved (at least without going off and digging through some excel-or-something-probably)?? Pre-creating Agents should be minimized as much as possible, whatever that means. Any agents who are to be created should be accompanied by decent data.

Creating agents first is a workflow that really is required by our model since everything else requires an agent. What about a temp/ pending / in progress sorta flag/ table/ queue for these newly minted agents which appears at first appearance as low-data agents because we havent assigned records, identifications, loans etc as well as no biographical info YET?

2. Arctos cleanup has not progressed to the point where really cool things look easy, and seems to have stalled. Ben should be marked for merge. One person probably wasn't murdering turtles in 1950 (from Arctos) and then getting a doctorate in 2004 (from ORCID) so they're probably not the same (and the DMNS kingfisher-killer is probably yet another person @acdoll ) so this might not be completely trivial, but we should have some way to either attach more data to Ben or to purge him to verbatim agent land. Not a clue how we do that, but this is going to be difficult until we do. One component is that we are still creating 'just above the bar' Agents - way too many people who do not seem to understand the goals have create permissions. Seems like a job for @ArctosDB/agents-committee to me....

I'm not sure that works-- for a committee to work through potential merges? Seems like a job for a tool to suggest merges that can be reviewed. We need another separate tool, right? I actually think the safer route is to err on the side of creating dup agents from new incoming because we dont know if they are the same or not especially for common name combos. Then let some agent clean-up tool do its thing but we should understand that it's an iterative and progressive ongoing process as more data is added to Arctos.

See above about a pending table/ flag/ whatever

3. Data format: There's an issue somewhere, I'm doing horrible things to the CSV by request, but hopefully the above means that doesn't really matter....

you lost me on point 3. Not sure what this CSV looks like (will edit if I find that issue) but is it impending adding agents to use?

Also will tag Erica Krimmel (maybe start project for Switzer?)

Jegelewicz commented 1 year ago

I was going to tag Erica but it seems she is no longer part of the Arctos Github organization?

dustymc commented 1 year ago

This seems pretty trivial at this point.

  1. Stop creating barely-over-the-bar agents, because we're not going to progress as long as we allow that. (Most recent: https://github.com/ArctosDB/arctos/issues/6679) If there is something that requires an Agent when nothing independent is known and so inspires these, fix that under highest priority.
  2. Create or add to agents when there's identifying information (probably near impossible to avoid), use verbatim when there's not.

We need another separate tool, right?

Not that I can see, we just need to purge the low-information agents and stop making more of them. There's a functionally-identical and (much) easier to use path to that.

committee to work through potential merges

Agents should carry information that doesn't require a committee. If they don't have that then they don't need to be agents.

safer route is to err on the side of creating dup agents

Not if we care about calling people what they'd prefer to be called or providing proper attribution for their work. We need to confidently support multiple unambiguous agents of the same name to do that. That is unavoidably incompatible with creating ambiguous agents. (Or we could take over the world and require unique names. Dibs on 0453c2b6-cf35-4155-8e5c-a5404062f5e3!)

And to be clear, I'm not trying to say we should or shouldn't do anything in particular, I'm just trying to spell out what's necessary if we want to do the things I think I've heard from the collections and the larger community. We can do about anything else, but how we model the data will unavoidably control the possible functionality.

DerekSikes commented 8 months ago

When creating a new agent there is a remarks field that one can type remarks in. This field does not seem to display on any of the new agents pages.

ewommack commented 8 months ago

That may be the curatorial remarks field. There are both curatorial remarks and public remark fields. One allows data to be entered that the public cannot see, so we can keep information more private, or keep notes on Agents. Look for the little public versus private icon when adding an attribute.

DerekSikes commented 8 months ago

It's this field (center of image). I made an agent typed a bunch of text into that field to help others know if this was the agent they were looking for, and then saved. When I checked the agent pages in other views that remarks field was gone (so folks won't benefit from all the disambiguation I typed). agents remarks field

ewommack commented 8 months ago

Hmmm no I think that is still there. Ah found it, click the link for See Full Agent Attributes. It should be there with all of the other nitty gritty for the agent in the Table form for the attributes.

DerekSikes commented 8 months ago

I see no link for Full Agent Attributes. I see links for Full Agent Activity Report or Summary Report. See below Screenshot 2024-02-26 at 5 37 05 PM

dustymc commented 8 months ago

may be

No....

Bug patched.

DerekSikes commented 8 months ago

When I go to select the agent that I had added remarks to the remarks are not visible in the agent picker tool (but a remarks field is present, it's just empty???) Screenshot 2024-02-27 at 7 23 35 AM

dustymc commented 8 months ago

had added remarks to

If you did this during creation before the bug above was fixed: it wasn't added, please edit to add.

If something else: I need details.

DerekSikes commented 8 months ago

How do I get to the page to add the remark? That field/page seems to only be accessible when creating a new agent. There is no editing access?

dustymc commented 8 months ago
Screenshot 2024-02-27 at 08 44 41

https://arctos.database.museum/agent.cfm?agent_name=Johnson

Screenshot 2024-02-27 at 08 45 12

or from the picker

Screenshot 2024-02-27 at 08 45 49 Screenshot 2024-02-27 at 08 45 58
DerekSikes commented 8 months ago

That brings me to this page. There are remarks fields for each attribute but I don't see a remarks field for the agent (unassociated with a particular attribute). Screenshot 2024-02-27 at 7 52 23 AM

dustymc commented 8 months ago
Screenshot 2024-02-27 at 08 57 36
DerekSikes commented 8 months ago

Ah, the remarks IS an attribute!

Ok, I think that problem is solved.

Here's a new one. There's an error on the bottom of this page: https://arctos.database.museum/agent/21351197?deets=true

And another - the agents picker doesn't list the 'alive' date attribute, wouldn't that be a super useful way to disambiguate?

dustymc commented 8 months ago

error

I'll get another patch out, probably tonight.

agents picker doesn't list

New Issue please (shouldn't be a problem, I just need the request).

DerekSikes commented 8 months ago

ok!