ArctosDB / arctos

Arctos is a museum collections management system
https://arctos.database.museum
Apache License 2.0
59 stars 13 forks source link

UCM Agent Bulkload Request #7894

Closed javanveldhuizen closed 2 weeks ago

javanveldhuizen commented 4 weeks ago

cf_temp_pre_bulk_agent_download_version ready.csv Please bulkload the agents in the attached file.

Note: The file should be results from the Agent Prebulkload Tool. If the file is too large for Github attachments, comment here and an email address or shared file space will be provided to you.

Jegelewicz commented 4 weeks ago

S-C Lee C-C Chen C-P Chen J-T Chao

and others with a dash in preferred name. First names should not include punctuation other than a period. Are we sure these are people and should they be:

S. C. Lee C. C. Chen C. P. Chen J. T. Chao

?

R/V Soyo-Maru

Is not a person, but a research vessel? If so, this may be added as an organization.

A. C. Burrill R. C. Burrill

This really feels like someone somewhere mis-transcribed an A for an R or the other way around?

W. F. Halliday W. R. Halliday

Ditto for the F and R in these two

Mr. A. E. Collins Mrs. A. E. Collins

And the D and K here

D. A. Han K. A. Han

Add the "spouse of" relationship between these after they are added?

Will Eberle-Taylor Nick Eberle-Taylor Quinn Eberle-Taylor

I assume these people are related? Do we know how?

Can I be convinced that these are really not the same person?

William W. Hay W. Hay

Or these two?

Norman E. A. Hinds Norman E. C. Hinds

All of the "not the same as" relationships require a method and determiner.

I am not trying to be obstructionist, but it seems like there is still some cleanup that could be done before we add these agents? I stopped looking at the near matches, so there are probably others I would add to the categories above.

javanveldhuizen commented 4 weeks ago

No worries. Thanks for catching those. Updated agent list attached: cf_temp_pre_bulk_agent_download_final version.csv

javanveldhuizen commented 3 weeks ago

@dustymc Thanks for including me in the https://github.com/ArctosDB/arctos/issues/7649 issue. Maybe we should pair our list down so that the only agents that get uploaded are ones that have full names (i.e., no initial) or have one (or more) attribute that distinguishes them (makes them unique) from other agents? So, for instance if we have a J. Smith the only way we can upload that person as an agent is if we had an attribute, say "child of", linked to that agent. Would that work?

Jegelewicz commented 3 weeks ago

So, for instance if we have a J. Smith the only way we can upload that person as an agent is if we had an attribute, say "child of", linked to that agent. Would that work?

That will help, but the ones I am struggling with include things like

Barbara Waleis which feels like it may be a mistranscription of Barbara T. Waters

Charles A. Nelson feels like a mistranscription of Charles D. Nelson (or perhaps it is the other way around, A and D can look very similar when written or maybe these ARE two different people, but I have no way to decide that)

Chin-Tsong Lewis and Chin-Tsong Lo - one of these must be a misspelling, an alternate name for the same person, or are they related people?

You may have no way to figure out if my "feelings" are justified, but if you do, it might be good to get things like this sorted before making agents.

As before, I did not peruse the entire list to look for these internal issues, but there are probably others! Do not take this as a summary of everything that I think needs review - just ideas for looking at the data you have in-house even before comparisons to Arctos agents.

javanveldhuizen commented 3 weeks ago

Barbara Waleis which feels like it may be a mistranscription of Barbara T. Waters

I can confirm that Barbara Waleis and Barbara T. Waters are two different people. Waleis is a collector from the 1930s, while Waters is a collector from the 1980s.


The others are all agents for the invert zoo collection, which will need to be checked by @Krmartin3 when she gets back from vacation. I can say that the Chinese do use hyphenated first names. So, Arctos may need to figure that one out, but I'll let Kelly chime in when she is back.

Charles A. Nelson feels like a mistranscription of Charles D. Nelson (or perhaps it is the other way around, A and D can look very similar when written or maybe these ARE two different people, but I have no way to decide that)

Chin-Tsong Lewis and Chin-Tsong Lo - one of these must be a misspelling, an alternate name for the same person, or are they related people?


In the mean time, I'm going to pull all of invert zoo's agents from the sheet, as I think most of the issues are coming from that side (sorry Kelly). I'll reupload a new sheet of agents here in a bit.

javanveldhuizen commented 3 weeks ago

@Jegelewicz new list of agents attached cf_temp_pre_bulk_agent_vert paleo agents only.csv

dustymc commented 3 weeks ago

@javanveldhuizen the dates in that CSV have been mangled (probably by Excel?).

javanveldhuizen commented 3 weeks ago

@dustymc Interesting, the dates look fine on my end. Screenshot 2024-07-02 075148

Should I use a different program to edit the CSV instead?

javanveldhuizen commented 3 weeks ago

@dustymc Ok. I edited the CSV using Notepad and changed all the dates into the desired format: yyyy-mm-dd. Let me know if that doesn't work.

cf_temp_pre_bulk_agent_vert paleo agents only.csv

dustymc commented 3 weeks ago

look fine

Yea, but they don't SAVE fine (eg unambiguously), which is why we require CSV.

https://handbook.arctosdb.org/how_to/How-to-Excel-for-Arctos.html#dates (I wrote the 'eat your data' bits but not the niceties at the top!)

Thanks, I've got those in the pre-loader.

The first thing in my view is "Humboldt Museum" - surely that's https://arctos.database.museum/agent/21336826 or https://arctos.database.museum/agent/21348575??

javanveldhuizen commented 3 weeks ago

The first thing in my view is "Humboldt Museum" - surely that's https://arctos.database.museum/agent/21336826 or https://arctos.database.museum/agent/21348575??

It's kind of actually neither of those things. The specimens I have tied to the Humboldt Museum were donated to us from a researcher at the Humboldt-Universität zu Berlin. What's unclear is whether these were actually part of the museum at that university, which later became the Museum fuer Naturkunde der Humboldt-Universitaet Berlin, or if they were a part of a researchers lab collection. I kept is Humboldt Museum until I could fully untangle it. Feel free to delete it from the list if you feel that it is not an appropriate true agent.

javanveldhuizen commented 3 weeks ago

@dustymc Here is the agent sheet again with the Humboldt Museum removed cf_temp_pre_bulk_agent_vert paleo agents only.csv .

dustymc commented 3 weeks ago

you feel

Ugh, that should not be the path, @ArctosDB/agents-committee HELP!

Lacking further guidance, that seems a somewhat defensible position to me (and a remark would be useful, if that's not already there).

I loaded data to https://docs.google.com/spreadsheets/d/1it7JgDc0Fxnccn5yD_bO6kdYFjPRrbJhqptOVAOu3G8/edit?gid=907589706#gid=907589706

Again an "interesting" situation on the first line!

Screenshot 2024-07-02 at 08 58 39

First your agent will load, then Arctos will run....

arctosprod@arctos>> select getAgentID('David Taylor');
 getagentid 
------------
   21333592

except two results will be returned - this one and the one just created - which will result in an error. Maybe that's somehow my problem, but I'm not quite sure how to address it. https://arctos.database.museum/agent/21333592 will always be unambiguous, but isn't great for humans to work with in a spreadsheet.

Beyond that, I don't know how to proceed. (I'd use verbatim agents as a first pass so we don't have to guess from strings, but I seem to have lost that argument!)

person | Sarah E. Rieboldt | attribute match: first+last variants Sarah Rieboldt person | first name | Sarah |   |   |   |   |   |   |   | middle name | E. |   |   |   |   |   |   |   | last name | Rieboldt |   |   |   |   |   |   |   | not the same as |   |   |   | Sarah Reiboldt | 2024-07-01 | Jacob Van Veldhuizen |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   | dlm |   |   |   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- person | Bill Simpson | attribute match: first+last variants William Simpson person | first name | Bill |   |   |   |   |   |   |   | last name | Simpson |   |   |   |   |   |   |   | not the same as |   |   |   | William Simpson | 2024-07-01 | Jacob Van Veldhuizen |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   | dlm |   |   |   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- organization | Brigham Young University Museum of Paleontology | attribute match: aka Brigham Young University Life Science Museum organization | aka | BYU |   |   |   |   |   |   |   | Wikidata | https://www.wikidata.org/wiki/Q4836911 |   |   |   |   |   |   |   | not the same as |   |   |   | Brigham Young University Life Science Museum | 2024-07-01 | Jacob Van Veldhuizen |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   |   | dlm |   |   |   -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- | -- look pretty suspicious (and maybe that's OK, I don't know, this should still not be my call @ArctosDB/agents-committee !!) I didn't scroll very far, just enough to grab a couple examples. I don't see any super-obvious duplicates or mistyped agents or such in the file. I REALLY don't want this to be my call (see above, I'd do something entirely different!), and the ~30 flagged by the checker could definitely use careful review, but loading this doesn't seem unreasonable. @Jegelewicz @mkoo thoughts??
javanveldhuizen commented 3 weeks ago

@dustymc

arctosprod@arctos>> select getAgentID('David Taylor'); getagentid

21333592

I have deleted David Taylor from my list and will make him a verbatim agent for now until that issue is fixed. I can confirm that the David Taylor already in Arctos is not the same David Taylor in my data.

person Sarah E. Rieboldt attribute match: first+last variants Sarah Rieboldt person first name Sarah               middle name E.               last name Rieboldt               not the same as       Sarah Reiboldt 2024-07-01 Jacob Van Veldhuizen                                                                                                                 dlm      

For some reason Sarah Reiboldt keeps reappearing in this list even though I keep deleting it. Anyway, I've deleted it once again and I can confirm that the Sarah Reiboldt already in Arctos is the same Sarah Reiboldt in my data.

person Bill Simpson attribute match: first+last variants William Simpson person first name Bill               last name Simpson               not the same as       William Simpson 2024-07-01 Jacob Van Veldhuizen                                                                                                                                   dlm      

The Bill Simpson I have in my data is an amateur collector in the Denver area and not the William Simpson already in Arctos. These are two separate people, as indicated by the "not the same as" attribute.

organization Brigham Young University Museum of Paleontology attribute match: aka Brigham Young University Life Science Museum organization aka BYU               Wikidata https://www.wikidata.org/wiki/Q4836911               not the same as       Brigham Young University Life Science Museum 2024-07-01 Jacob Van Veldhuizen                                                                                                                                   dlm

The BYU Museum of Paleontology and the BYU Life Science Museum are two different organizations. Here are their websites so you can confirm:

New list here: cf_temp_pre_bulk_agent_vert paleo agents only.csv

dustymc commented 3 weeks ago

David Taylor

You can also just create the agent manually (where everything involved IDs instead of strings).

as indicated by the "not the same as" attribute

Sorry, I didn't look very carefully (was aiming for general considerations, not specifics!), thanks!

New list

running....

https://docs.google.com/spreadsheets/d/1SBF83EZncUko6u1KkVzbQdhaPGDULnNVNKuSEn6Leak/edit?usp=sharing

I suppose I should just load that??? @mkoo

dustymc commented 2 weeks ago

@javanveldhuizen I found a problem on my end and am rolling a partial load back, but during that I noticed

Ward Scientific Wards National Science

in these data. Surely those are both duplicates of https://arctos.database.museum/agent/21293521?

javanveldhuizen commented 2 weeks ago

@dustymc I deleted those agents. They need some verification. New list here: cf_temp_pre_bulk_agent_vert paleo agents only.csv

dustymc commented 2 weeks ago

Done and blamed on you @javanveldhuizen

There's one full-duplicate low-data copy of another low-data agent that maybe ought to have something done with it.

 agent_id | agent_type | preferred_agent_name |       creator        |        created_date        
----------+------------+----------------------+----------------------+----------------------------
 21354938 | person     | Scott Parker         | Jacob Van Veldhuizen | 2024-07-03 14:40:17.101114
 21257771 | person     | Scott Parker         | unknown              | 2013-12-16 21:49:31
(2 rows)

and one that errored out

cf_temp_agent_download(3).csv