Closed dustymc closed 1 year ago
First pass: Attached are 1883 agents who have either one-word or initials preferred names, and who are not found outside of table collector.
Proposal:
temp_agent_clean_first.csv.zip
I'll proceed (using fresh data) if there are no objections by 2022-04-27. whenever the conversation draws down.
Please retain 21263988 | Sanbornes
Please retain
If we proceed with this, that would be a matter of data. Maybe we'll be able to see through the clutter enough to build better rules at some point, but for now just about anything would escape the filters I'm working with. Address=South Pacific, alive=1972, WHATEVER. We'd like to have a bar, but at least initially it'll be a very low bar!
Some remark suggests they should be involved in an accession - that would stop this, but hopefully only temporarily.
Agent remarks suggest a name that might lead somewhere and the activity suggests one person, why not just use that and put the uncertainty in the remarks? Maybe we also need some sort of Best Practices document (or the existing cleaned up or added to) - "when given X, we suggest doing Y...."
Unrelated to agents, some other remark makes me suspect this wasn't collected after 1973, and I'm absolutely positive it wasn't collected tomorrow - event dates could be tightened up a LOT (but not as much as they could have been yesterday...).
We need to make a pass through this because this one
21313587 | á‘á’რ| first name=á‘á’áƒ|aka=Kigai; Remark: Ethnology and History verbatim agent; carver
probably needs to be kept as is
OBJECTION! Please don't delete anything yet ... but I should be able to get my agents clarified by the time you proceed. That said, as I go thru the list (i'm ever so glad I put my collection in the agent remarks field!) most of my single name individuals fall into one of two categories:
I have argued in the past for both of these types of single named agents to not be deleted or flagged as somehow "less valid" (i.e., moved to verbatim collector) than a record with more than one name. I will fight all night long to defend the single name Indigenous creator record. I will also defend the use of the name that is printed on the label as the preferred name, but will encourage our staff (including myself) to do a better job of finding the full corporate name, if it exists online).
[I'll now get down from my soapbox...]
@AJLinn brings up a few good points
The format of the agent name isn't in any way the problem, it's just a convenient place to start. This should eventually involves ALL agent names; they're still just strings, even (maybe especially!) if there are 17 "words" involved.
I should be able to get my agents clarified
Please let me know if there's any way I can help - pull data out, put it in, WHATEVER. If this comes down to one-by-one it may never get finished. (But it got started and we're thinking about this stuff and that's something!)
in the first decade of the 1900s
Great, add that (or the publications or whatever) and the agent easily clears this bar.
manufacturer
Ditto. (And bigger picture, it seems we're going to be forced past our unique preferred name restriction at some point, which would be a lot more approachable if we could tell the Nike in Oregon from the Nike from Greece.)
somehow "less valid"
See above, these are just a convenient place to start. I can drop this and grab a couple thousand random or something if the format is a distraction.
also defend the use of the name that is printed on the label as the preferred name
That is embedded in the "forced past unique restriction" mentioned above. Doing that and avoiding the absolute most disrespectful thing we could do - not properly attributing work to the creators - is the core of this; right now, if both Nikes show up and (reasonably) demand we use their name, we just can't. If we somehow allow two Nikes, we can't tell them apart (except maybe by digging through remarks, which isn't realistic) which leads to us attributing god-stuff to the shoe-folks. We need more data to move past our restrictions.
full corporate name
Please note that more names won't stop this (or that's how I hope it plays out, anyway). This is fundamentally a request for some sort of actual data beyond strings/names. The ideal form of that is something which leads to a lot more data - a ORCID/WikiData/LoC/whatever address - but the bar isn't that high (yet?? Probably never...) and a vague address (Canada
) or status date (alive in 1905
) will (we so hope) meet the foreseeable needs.
I was going to refer to documentation - much of the requested information exists, but not in such a way that machines (or humans, unless they're willing to dig) can find it, but the current documentation is not clear. @Jegelewicz the remarks section of https://handbook.arctosdb.org/best_practices/Agents.html#general-recommendations-for-creating-meaningful-agents should look more like https://github.com/ArctosDB/documentation-wiki/blob/ee9493ba951cb64639eb0e97fb51b5e909871c01/_documentation/agent.markdown - "Use remarks as a last resort" is the critical (and now missing) idea.
From the CSV:
Remark: UAM ethnology & history; sports uniform manufacturer in mid-20th century; moved from Pasadena to San Marcos, CA in 1971.
I copied some of that to appropriate places:
And now we have TWO non-name-based data points! There might be another 500 Spanjians out there, maybe even making Sportswear, and as long as they're not operating in San Marcos in 1971 they can't confuse anyone!
Now I'm gonna go file an issue about the values I had to use...
the remarks section of https://handbook.arctosdb.org/best_practices/Agents.html#general-recommendations-for-creating-meaningful-agents should look more like https://github.com/ArctosDB/documentation-wiki/blob/ee9493ba951cb64639eb0e97fb51b5e909871c01/_documentation/agent.markdown - "Use remarks as a last resort" is the critical (and now missing) idea.
moved remarks stuff to Don't
21313587 | á‘á’�á�ƒ | first name=á‘á’�á�ƒ|aka=Kigai; Remark: Ethnology and History verbatim agent; carver probably needs to be kept as is
ᑭᒐᐃ is acting as a creator, I think it's safe to assume they were at the creation event which carries places and dates. I don't want to get into some tail wagging the dog situation so I'm (extremely) hesitant to just make those assertions, but I could round them up for human review (and help load anything which passes that).
The other viewpoint is that ᑭᒐᐃ is functionally nothing but a string stored in a complicated way at the moment, changing that to a string stored in a less-complicated structure doesn't change any meaning or function that I can identify. At some point hopefully someone will "elevate" some/many/most "simple string agents" to agent objects (because they want to do something that requires the complexity, not "just because" - I hope), and I'm happy to build tools to facilitate, I just need a use case. (I don't think we're missing any functionality now, but I can probably save some clicking.)
Note also that this approach would unavoidably allow what we're really trying to get rid of. If for some reason someone wants to scrounge up data for T. K. (who seems to be no more than a footnote in an obscure publication), then doing so would put them in the "safe pile" along with any other more-than-strings agent. I'm not sure if that's a feature or a bug, but it's probably unavoidable under this viewpoint.
@dusty is it possible to get a csv or SQL for UCM records using values from temp_agent_clean_first.csv.zip? That way I can more easily take a pass at reviewing and adding more agent info when possible.
I did this
select string_agg(guid,',') from (
select concat(guid_prefix,':',cat_num) as guid from cataloged_item
inner join collection on cataloged_item.collection_id=collection.collection_id
inner join collector on cataloged_item.collection_object_id=collector.collection_object_id
inner join temp_agent_clean_first on temp_agent_clean_first.agent_id=collector.agent_id
where guid_prefix like 'UCM:%'
) x
but the result is a bit awkward to pass around so https://arctos.database.museum/archive/ucm_issue_4554 - let me know if you need something else.
"Use remarks as a last resort" is the critical (and now missing) idea.
I actually really disagree with this idea, unless we instead add a free text field called biographical profile or biographical summary. This is essential, useful data that helps distinguish one John Smith from another, it shows up in our agent summary, and is critical for understanding the context of our collections.
Compare our agent record for Robert Bloom to that of the UAF Archives (which is a short one also):
It's easier and more useful than creating a PDF of a biographical profile and attaching it as a media file to the agent record... more clicks and downloads.
We already allowed for markdown formatting for paragraphs of text, so the agent summary page looks better when there's more there.
just create a relationship to your organization (associate of) instead of or along with remarks and that should cover it
I'm not sure this is an appropriate way to "claim" that agent. I'd prefer to add some born/alive/died/dead data, some geographic information in an address field, or additional biographical info if it's able to be located. Sometimes it's an oral history recording or maybe a historical photo in an online digital archive. Would that help fulfill some data points you're looking for @dustymc ?
essential, useful data that helps distinguish one John Smith from another
For anyone who reads it: sure. A date buried in there is also completely inaccessible to things like https://github.com/ArctosDB/arctos/issues/4551 (and probably most users). The current documentation says "Don’t use remarks when more formal data are possible." which I believe is correct - we do have an appropriate "more formal" field for places (address) and dates (status) so that doesn't belong (or only belong, I don't care what's replicated in remarks to be more readable or etc.) in remarks. We don't have a place for biographical profile so that does belong in remarks. Unless....
add a free text field called biographical profile
New issue, no objection from me (as long as it can be defined in such a way that it's not "remarks when someone felt like using that field").
create a relationship to your organization (associate of)
If they're working for you: Yes, absolutely.
If they tossed a dead rat (or motorcycle or whatever) at you at some point: Nope, over-using relationships will just result in those data not getting cleaned up when we get access to tools (or brains).
born/alive/died/dead data....geographic information in an address field...historical photo i....online digital archive
Any of that will get the agent over the (tentative) current bar. I'd of course like to have all of it and in great detail, but at this point any sort of structured data feels like a great leap forward.
The conversation seems to have drawn down, OK to proceed per https://github.com/ArctosDB/arctos/issues/4554#issuecomment-1098531767?
If by proceed you mean nuking all the one-name agents, I'm still working on my mega-list to add "alive" info and "shipping" address so there are three points of data. Can you give me time to fix them? I can prioritize for the next couple of days.
No hurry, I just don't want to lose whatever momentum we've got going.
Let me know if I can help with anything.
Looks like I have 50 agents to update, which unfortunately I don't think there are any automated wizard things we can do other than looking at their agent activity report and assessing each one individually. We'll see how long it takes!
See https://github.com/ArctosDB/arctos/issues/4568 - we discussed rebuilding the activity page (somewhere...), let us know what would be useful to surface there.
I did it! In this Google Sheet, all the lines highlighted in blue were fixed in some way, mostly adding an alive date and adding a shipping address. Lines left yellow should be moved to verbatim collector as there was not enough information available to justify keeping the agent. Some of the names were corrected, marked as bad duplicates, or had full names found thru an Ancestry.com search. so their agent record no longer exists as on this sheet. Column D indicates what action I took on each agent. Thanks for your patience. Please let me know if you find a problem or need further clarification.
That's awesome @AJLinn, and I think good evidence that these low-information string-only Agents are in fact leading to problems, both by being confounded with other agents, and by "good" agents being lost in amongst the truly low-information.
Here's another run of the initial query, except I also excluded agents created in the last year.
I'll add this to the AWG Agenda for increased visibility before shuffling anything, and - assuming this is a direction we are comfortable going in - we seem to need two more pieces of best practice documentation:
AWG discussed at length, please see notes in the agenda under the Agents Committee section: Create flags when dates don't match (death date is before a collecting date example C. H. Townsend)
Agent Committee looking into starting to clean up agents
thanks for the bullet points @lin-fred To expound on a few points:
We're not saying we want low-data agents (!)
If I'm reading this right you are, whether that's the intention or not!
we will deal with them in stages
I'm just proposing a better separation between those stages, which would let you make the decisions (most of them, anyway) in the context of everything in Arctos rather than the typical maybe-one-collection spreadsheet.
priority should be to get data into Arctos
Agreed - but that doesn't mean we need to make a mess of what should be formal data. I'm proposing to lower that bar, or delay much of the cleanup if you want to look at it that way.
too many barriers to adding in agents (and thus records)
Perhaps this is the point of divergence: Those things are not (entirely) related, you don't need to do anything with maybe-eventually-Agent Collectors to load records, and all evidence suggests you shouldn't bother trying - making that call in the context of other stuff (rather than eg in some spreadsheet) is less work for better data.
not enthusiastic about doing more with verbatim agents unless we are missing some critical coolness about them
I think/hope this is the same as above, the coolness is there and has been for a while, if all you have are strings (eg names) then these are entirely functionally identical and you're just making work (for you and everyone who will need to eventually sort through your low-data messes) with no benefit at all for anyone by forcing them to Agents.
may require a shift
Yes, but it's a simplification.
and general community agreement
I can see no losses, this looks all good/no bad, I'm not sure what would be required beyond the initial agreement to move string-only Agents to a string-only format.
This impacts new collection data migration work (large bulkloading of names)
What I'm proposing does indeed impact that, by mostly eliminating it. 90+% of any new collection is "collectors" (the node, not the role) - those could be removed, and the effort focused on the remaining ~10% (donors and identifiers and such). Still looks all good+no bad from here....
differently than cataloging new records
Perhaps, but I might be able to figure a UI solution out. ("This isn't an agent, {create} or {use attributes}." doesn't seem entirely unapproachable.)
which has most of the agents cleaned-up too.
I'm no longer advocating for "clean" (whatever that might mean), I'm advocating for "carries more information than strings can" as the bar.
I think (hope?!) I'm not being clear on something, so here's the proposal again:
That's really it. Do less work when there's no reason to do more, give up nothing in the process. I'm not sure why that's controversial?
If "can be handled by the Attribute" needs elaboration, it's Agents which are referenced only by tables agent_name
and collector
. Agents which act as identifiers/donors/anything except collector and agents with any status, address, or relationship information would not be affected in any way, other than being surrounded by a lot less clutter (and I can't see any way that won't lead to better data, scroll up for lots of examples even from this tiny initials-only corner of our mess).
It just seems like moving all these agents to verbatim will make cleanup more difficult. Say I have specimens collected by "firstname lastname". I don't know who this person is at the moment. Another collection also has specimens collected by firstname lastname. Their collector is in verbatim because they also don't know anything about them. Their verbatim agent actually collected very similar things to mine, at similar times and in similar places, but I don't know that because it is in verbatim and there is no agent activity report to clue me in. Maybe that link between my collector and their collector could have helped me figure something out about this person. So, I also add firstname lastname as a verbatim agent. Later, someone else comes along and adds some publication by firstname lastname, adding this person as an agent in the process. Great! More is known about this person. But since all our collecting records for firstname lastname are all in verbatim, no one will ever know. Especially if someone somewhere along the way misspelled this person's name as firstmame lastname and nobody realized because the verbatim field has no code table.
@Nicole-Ridgwell-NMMNHS I don't think you're wrong about any of that, except that it demonstrably results in a whole bunch of variations of 'firstname lastname' that never get reconciled, and when the next person comes along they just throw up their hands and create one more because why not - and the pile gets a little more impenetrable.
I don't think rounding up all 500 firstname lastname variations in verbatim is any more work than in agents. Maybe it's even less work, because some of those agents tend to get misattributed to all sorts of unlikely things where the verbatim are more isolated, IDK.
I'm actually not sure why I'm not suggesting that now, I'll open an Issue.....
Convenient example, here's what's cooking now. They're largely from the same project, I don't think anybody involved is careless in any way, they're investing a lot more time than most collections would, etc., etc. - I think this is about as good as it gets, and it still results in a lot of duplicates because (a) its mostly just strings, that's always a bit of a guessing game, and (b) there are a LOT of existing strings to sort through.
The proposal that this has turned into would just isolate those string-only data; everything left in Agents would have some other bit of information available, and the strings wouldn't have any possibility of having been confounded with each other or anything else through an erroneous merge or etc.
getpreferredagentname | agent_relationship | getpreferredagentname
-----------------------+--------------------+------------------------
Renn Tumlison | bad duplicate of | C. Renn Tumlison
D. R. Herter | bad duplicate of | Dale R. Herter
J. L. Sands | bad duplicate of | James L. Sands
PREP STAFF | bad duplicate of | Prep. Staff
R. E. Mumford | bad duplicate of | Russel E. Mumford
Allison J. Schultz | bad duplicate of | Allison J. Shultz
E. J. Larrison | bad duplicate of | Earl J. Larrison
G. McLin | bad duplicate of | Glen McLin
C. W. Richmond | bad duplicate of | Charles W. Richmond
J. MacCracken | bad duplicate of | J. G. MacCracken
J. T. Weir | bad duplicate of | Jason T. Weir
M. A. Etnier | bad duplicate of | Michael A. Etnier
C. Hrycko | bad duplicate of | Christopher Hrycko
O. A. Willett | bad duplicate of | Ora A. Willett
Frank Pitelka | bad duplicate of | Frank A. Pitelka
C. G. Rinker | bad duplicate of | George C. Rinker
George Rinker | bad duplicate of | George C. Rinker
W. Wileman | bad duplicate of | W. C. Wileman
Syd Anderson | bad duplicate of | Sydney Anderson
D. E. Metter | bad duplicate of | Dean E. Metter
G. Rinker | bad duplicate of | Gary Rinker
Tovar | bad duplicate of | unknown
Craig Hilburn | bad duplicate of | David Craig Hilburn
B. T. Ostenson | bad duplicate of | Burton T. Ostenson
J. L. Reid | bad duplicate of | Julia L. Reid
P. Clifton | bad duplicate of | Percy L. Clifton
R. A. Campbell | bad duplicate of | Ronald A. Campbell
V. Shafer | bad duplicate of | V. W. Shafer
Z. Fry | bad duplicate of | Zerol Fry
C. S. Thaeler | bad duplicate of | Charles S. Thaeler Jr.
J. B. Bowles | bad duplicate of | John B. Bowles
J. L. Hayward | bad duplicate of | Jim L. Hayward
K. Estlund | bad duplicate of | Kevin Estlund
J. Gurgel | bad duplicate of | Jo Gurgel
N. Marr | bad duplicate of | N. Verne Marr
S. Farag | bad duplicate of | Saleem Farag
S. L. Lindsay | bad duplicate of | Steve L. Lindsay
Dale Guthrie | bad duplicate of | Russell Dale Guthrie
N. E. Dochuchaev | bad duplicate of | Nikolai E. Dokuchaev
(39 rows)
I am the destroyer of agents. Look upon me and despair.
I am trying to summarize all of the concerns/questions surrounding this issue and get it all together as one comment before this weeks issues meeting. I am working out of our last Agent committee google doc, at the very bottom:
Here are my notes now that need some input:
Main issue: stop low-information agents, do more with verbatim agents #4554 Related: get all agent "names" in one place on the catalog record #4869 Code Table Request - verbatim agent #4871 Feature Request - try to match agents to verbatim whateverwecallems #4872 Agent cleanup - agents of type "other agent" #4853 Please add any more related issues here
Notes:
"Only use as collectors" is the proposal, but https://github.com/ArctosDB/arctos/issues/4871 does provide a mechanism by which a user/collection could choose to extend beyond that.
The agent_id appearing in loan/accession, addresses, relationships, or any of the other ~30 possible places would exclude the agent from any possible "verbatimizing."
The catalog record search does include verbatim; that's been out for some time.
Agents has 'activity' functionality - also not new.
https://github.com/ArctosDB/arctos/issues/4872 would add something to that - it will somehow try to tell you that 'B. Richards' (verbatim collector) maybe should be merged into https://arctos.database.museum/agent/21271606, but without being annoying when that's already been done. "Some tools exist, anything the data can support is possible" is the intention, we'll probably have to experiment a bit to know exactly what the data can support.
-- Is not connected to a loan/accession
I would also add publication and project or determiner of some thing (identification, attribute)
Agents has 'activity' functionality
I believe @lin-fred is looking for this functionality for VERBATIM AGENTS in order to help determine if they are the same as some existing agent.
add publication and project or determiner of some thing (identification, attribute)
No real objection from me, but any list will be long (and probably incomplete, we change stuff all the time). https://arctos.database.museum/tblbrowse.cfm?tbl=agent (minus collector) is a close approximation.
this functionality for VERBATIM AGENTS
Click one of them - that's "activity." (Other formats/tools/whatever are always possible, but that is the comprehensive summary.)
Click one of them - that's "activity."
What do I click?
from agent search. (But I could make those clicky, might be kinda cool....)
I still don't get it! If I search Dalquest in Agents - I don't see W. W. Dalquest
Is this only from the main search page?
AH HA! DOH!
Make default yes?
I wouldn't do that - the number of W. W. Dalquests is completely overwhelming....
It really is/should be a toll for agent cleanup. I think intentionally selecting it makes more sense - We just need to document the option.
from agent search. (But I could make those clicky, might be kinda cool....)
I think it would help if they are also clicky in the record itself
AWG: Likes the direction (yay!!).
TODO
clicky in the record itself
"Dumb version" in next release, https://github.com/ArctosDB/arctos/issues/4872 might provide a mechanism to do more than string-match, revisit record once that's developed
@dustymc what will happen when someone in the future creates an agent that meets this level of low quality. Is there going to be a script that runs every so often that merges it into verbatim?
Can we disallow that? Any agent MUST have at least one bit of information besides names/akas?
Can we disallow that? Any agent MUST have at least one bit of information besides names/akas?
But currently, if you make a singular agent, you can only add their name and a remark on the creation page. And so as it is, no agents would be able to be created. Because we currently create the agent, and then edit to add in all the fluff.
So do we want to change that? But then how many fields do we add to the "create agent" button?
what will happen
If I get to choose, what @Jegelewicz said sounds great, just outright ban low-information agents. No confusion, no surprise scripts, no complications, no running automation, nobody showing up on my lawn with pitchforks because they didn't read the docs and all their agents vaporized, pleaseplease lets do this...
(We can figure out the UI/details.)
If we can't get that organized, then a monthly-or-whatever purge of year-or-whatever-old low-information agents could work.
If I get to choose, what @Jegelewicz said sounds great, just outright ban low-information agents. No confusion, no surprise scripts, no complications, no running automation, nobody showing up on my lawn with pitchforks because they didn't read the docs and all their agents vaporized, pleaseplease lets do this...
(We can figure out the UI/details.)
If we can't get that organized, then a monthly-or-whatever purge of year-or-whatever-old low-information agents could work.
ok cool
If I get to choose, what @Jegelewicz said sounds great, just outright ban low-information agents. No confusion, no surprise scripts, no complications, no running automation, nobody showing up on my lawn with pitchforks because they didn't read the docs and all their agents vaporized, pleaseplease lets do this... (We can figure out the UI/details.) If we can't get that organized, then a monthly-or-whatever purge of year-or-whatever-old low-information agents could work.
ok cool
Minimum requirements for agents is fine as long as we show a clear method to use verbatim and a path to converting verbatim to full "agenthood" as data becomes available. I think the latter will be key to keeping pitchforks off your lawn
@dustymc can you link me the table of agent types that have the possibility of being merged into verbatim agent? You said it was the collector table and not just collectors?
Sorry if you already posted this in the thread.
so what happens with the collector_role values when they merge into verbatim agent?
Is your feature request related to a problem? Please describe.
We have a lot of low-data agents, they make everything in agent land more difficult than it needs to be.
Describe what you're trying to accomplish
Better data, less work.
Describe the solution you'd like
Describe alternatives you've considered
Much work, bad data.
Additional context
First Step: report of low information agents who don't have addresses or relationships and don't extend beyond table collector.
Priority
High, the problem gets worse with every new collection.
EDIT: the promised SQL
Just change the
CHAS:Mamm
ofwhere guid_prefix='CHAS:Mamm'
to an approriate value for other collections. Values can be found on https://arctos.database.museum/home.cfm.