Princeton-CDH / mep-django

Shakespeare and Company Project - Python/Django web application
https://shakespeareandco.princeton.edu
Apache License 2.0
5 stars 1 forks source link

As a global admin, I want a one-time import of data from the personography XML file so that I can manage person information in the database. #9

Closed rlskoeser closed 6 years ago

rlskoeser commented 7 years ago

Testing notes and suggestions

Update 9/20 for revisions on import logic


development updates

rlskoeser commented 7 years ago

I have an initial version of the personography import working, and lots of detail questions about the things I discovered while writing it.

I think most these are database decisions and would appreciate feedback from @jabauer, but @jkotin and @elspethgreen are welcome to weigh in as well.

Let me know if looking at the initial import would be helpful to answer these questions, we can set up the test site and do a test load.

questions and notes:

jkotin commented 7 years ago

Hi all -- I make some notes below. -- J

questions and notes:

• @jabauer personography has people in three lists, with labels "expat", "in-logbooks", and "others"; any thoughts where this should go? (notes?) @jkotin has talked about wanting to mark people as library member, library member with extant cards, other - but I think we'll be able to infer that once we add the logbook & card data.

I just want to be sure that we can isolate (via a search/filer feature) members with extant cards. We need to be able to analyze the relation between members with extant cards and the membership as a whole -- in a given year, over time, by degree of fame, etc.

• variations in names: we have people with birth and married names, people with multiple first names, people with a tag ('de Beauvoir), people with a nickname documented, documented pseudonyms, initials (I can provide examples in the xml for all of these if that would be helpful). @jabauer I remember you said something about how you wanted to handle birth names, but I don't remember. I think multiple first names or initials should go in the first name field. Name information that doesn't fit can go into the notes, maybe? What about names like de Beauvoir? • nationalities: Walter Benjamin is marked as stateless until 1935 and then German. @jabauer should dates on nationalities and "stateless" just go in the notes? (I think I recall you discussed this and intentionally kept nationality handling fairly simple) • titles: the database model has a title field; I was guessing this was for mr/mrs/dr/m/mme etc, but I don't see anything in the xml. Am I missing anything?

Some members should have a field for titles/ranks; e.g. Comtesse. At some point in the history of the project, we thought about what to do with the many appearances of "mrs" and "mme" on the cards -- e.g. Mrs. Baker -- especially when we could not locate a first name. We ended up keeping that information in the notes. The information is important because it indicates sex, marital status, and, often, linguistic preference. I'm not sure how much of this information is relevant at this juncture.

• addresses: a number of addresses include names, e.g. the name of the hotel; should this go into the address line 1 or should we add an optional name field to the address model? • side note on addresses: 11 addresses have that looks like a to me, I'm going to treat them as such (whatever we decide to do with names) • Did we ever decide how we wanted to handle the Parisian arrondissements? I understand I can get it from the last two digits in the Paris postal code. @jabauer should we go ahead and create a distinct database field so we can filter on them directly, or should we just pull a substring from the Paris post codes dynamically when we index for front-end search? — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rlskoeser commented 7 years ago

@jkotin question for you: how do you define "expat"? Is this something we can infer based on nationality and addresses? (I looked at the xml file, but it's not obvious to me from the nationalities and addresses in the expat list.)

@jabauer and I talked through the other issues I've identified and decided how to handle them. Documenting decisions here:

A lot of addresses have cities but no country; I'll provide a list and ask for a list of corresponding countries that can be automatically set on import.

jkotin commented 7 years ago

Hi Rebecca --

I'm back in Boston and should have time, finally, to test the site. Thank you Ellie for getting things started!

Re: expat -- that's a good question. (I apologize in advance for this long explanation.) We haven't been defining the term. We used it in the title of MEP because that's how the community of Anglophone writers associated with Beach is usually discussed in both academic and popular literature. It's a problematic term: most of the members of Shakespeare and Company were not expatriates at all -- they were French. It's also a tricky term to use to describe non-French members. Russians in Paris in the 1920s, for example, are usually called exiles or emigres -- Wikipedia has a page about the "White emigres." And it's difficult to distinguish expatriates and tourists and students. When does a tourist become an expat? But the benefits of the term, I think, outweigh these drawbacks -- everyone knows what we're talking about when we say "expatriate Paris." But I would love for advice about how to handle the ambiguity of the term in the database (if necessary) and on the site.

By the way, "The Lost Generation" is a similarly problematic and useful term. Technically, Stein, Joyce, and Pound (and Beach herself) are too old to count as members of the Lost Generation. Stein coined the term to refer to men and women who became adults during WWI. (Hemingway was born in 1899, for example.) But most people still use the term to identify the blossoming of modernist literature in the 1920s and 1930s. Noel Riley Fitch titled her history of Shakespeare and Company, Sylvia Beach and the Lost Generation: A History of Literary Paris in the Twenties and Thirties.

Again, sorry to go on and on! But I thought it might be useful to make these notes now and later incorporate them in a glossary on the site.

Josh


Joshua Kotin, Associate Professor Princeton University

press.princeton.edu/titles/11207.htmlhttp://press.princeton.edu/titles/11207.html

On Jul 27, 2017, at 12:49 PM, Rebecca Sutton Koeser notifications@github.com<mailto:notifications@github.com> wrote:

@jkotin question for you: how do you define "expat"? Is this something we can infer based on nationality and addresses? (I looked at the xml file, but it's not obvious to me from the nationalities and addresses in the expat list.)

@jabauer and I talked through the other issues I've identified and decided how to handle them. Documenting decisions here:

• nicknames should go in the notes • revise naming handling: name (all names in one field), sort name/authorized name: lastname, firstname (prepopulate from viaf where possible) • address: make city required (only one item in personography needs to be fixed for this) • address: adding care_of optional link to person • people with pseudonyms, e.g. bryher - use viaf authorized name, put "bryher (real name in parens)" • "stateless": country name "[no country]"; dates go in notes on the person • revise address fields: name (optional) and street_address (instead of street address 1 & 2) • friend of renoir, no name: use the label we have as full name: "[Friend of Renoir]" (put something in note to flag for cleanup) A lot of addresses have cities but no country; I'll provide a list and ask for a list of corresponding countries that can be automatically set on import.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

jabauer commented 7 years ago

Hi Josh -- It sounds like the best way to keep using "Expat" in the database is to create a boolean flag on the Person table to indicate whether someone was an Expat. That work would have to be done/reviewed by you or one of the research assistants. If you are uncomfortable with assigning specific people in the database as "Expat" that's fine too, but then it won't be a search term.

rlskoeser commented 7 years ago

It actually sounds to me like tagging would work well for this - then you can define however many tags you want, tag people in different ways, and describe how you are defining the terms you're using as tags. However, if we go that route we'll need to add it later.

My concern at this point is making sure we don't lose data on import - but it's not clear to me whether the categories in the xml personography are accurate. I see "expats", "in-logbooks", and "other", but the "expats" list includes a number of people with a nationality of France.

@jkotin do you know if the categories in the xml are still valid? Or are you comfortable with tagging or otherwise labeling people as expats later on?

jkotin commented 7 years ago

I think I misunderstood earlier! -- I thought I was responding to an abstract question about "expatriate" in Mapping Expatriate Paris.

Here's my answer to your actual question:

I think those categories are an attempt to differentiate among: 1/ members with extant cards; 2/ members without extant cards; and 3/ people involved in the world of Shakespeare and Company, but who were not members or who do not have a membership record.

I think Cliff (or someone) labeled these categories 1/ expats, 2/ in-logbooks, 3/ other, respectively.

Does this clarify things?

I am actually not interested in labeling who was an expat and who was not. But it is vitally important that we can filter the personography by the three categories above: 1/ members with extant cards; 2/ members without extant cards; and 3/ others.

My apologies for the confusion earlier.

Josh

On Jul 31, 2017, at 4:00 PM, Rebecca Sutton Koeser notifications@github.com wrote:

It actually sounds to me like tagging would work well for this - then you can define however many tags you want, tag people in different ways, and describe how you are defining the terms you're using as tags. However, if we go that route we'll need to add it later.

My concern at this point is making sure we don't lose data on import - but it's not clear to me whether the categories in the xml personography are accurate. I see "expats", "in-logbooks", and "other", but the "expats" list includes a number of people with a nationality of France.

@jkotin do you know if the categories in the xml are still valid? Or are you comfortable with tagging or otherwise labeling people as expats later on?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

rlskoeser commented 7 years ago

@jkotin thanks for clarifying, that is exactly what I needed to know. I apologize that my earlier question about defining "expat" wasn't clearer.

rlskoeser commented 7 years ago

It occurred to me that it might help you with your testing if I gave you some specific items to check for some of the different cases that come up in the xml data. I'll provide person ids so you can find them in both the xml and the admin interface and compare. I ran the import from the develop branch.

This isn't exhaustive, but meant to give you an idea of where to start looking (and hopefully give you more confidence about what the import is doing). At this point, I would say that you should expect to do some manual clean up after we run the import in production, but we do want to catch any programmatic errors or problems that are affecting a large number of records.

i-davis commented 7 years ago

I haven't been able to go through much of this yet, but the only issue I've found so far, & I'm not really sure if it counts as one, is that Claude Cahun was listed as "Cahun Claude" in the "Name" field, a departure from the usual firstname lastname order. Don't know if that's a problem, or indicates one? I changed it manually.

clmahoney commented 7 years ago

I found the same with Camille Mayran, who was listed as "Mayran Camille". But also her name is pretty complicated in its listing: "Camille Mayran (Henriette Sophie Marianne (Saint-René Taillandier) Hepp)"

elspethgreen commented 7 years ago

The Walter Benjamin stateless record looks pretty good! It imported the "notBefore" and "notAfter" tags for the dates of those nationalities in a note. Preserving the XML syntax in that way looks a little funky and takes a second to process, but I think it's fine.

elspethgreen commented 7 years ago

And Benjamin's URL from the personography was successfully imported with a note saying "from XML import"

rlskoeser commented 7 years ago

Looked at Claude Cahun and Camille Mayran - it seems my import doesn't correctly handle surname/forename on pseudonyms. I must have missed this because the pseudonyms are inconsistently tagged (Bryher has no surname/forename tags within the pseudonym). There are probably few enough of these that you could correct them manually if necessary (to skip another round of revising and testing the import script).

The complicated listing for Camille Mayran is based on all the names in the XML. That's another case that might be easier to clean up in the database after the import is done, and after we've decided to handle multiple names. (I forget how many people there are with multiple names like this.)

Thanks for checking the Benjamin record. My goal is to avoid losing any data in the import; you all can clean up notes re-arrange however you like after we run the initial import in production.

Let me know what other problems you find with the imported data - that will help us decide whether it's worth revising the script and doing another round of testing or just making a list of names to manually clean up after the initial import.

elspethgreen commented 7 years ago

There might be a problem with the Worthing records--I'm not sure. The note on "worthing2" seems to have incorrectly imported the mepID.
The text of the note in the personography is this:

Identified in the logbooks as Mrs Worthing. Assuming she was married to the person named Mr Worthing, who is listed twice in the logbooks. After 1927 subscriptions are sold to a person without distinguishing role (Worthing); we are assuming that person is Mrs Worthing.

The text of the note in the import is this: Identified in the logbooks as Mrs Worthing. Assuming she was married to the person named Mr Worthing, who is listed twice in the logbooks. After 1927 subscriptions are sold to a person without distinguishing role ( [#worthing1]Worthing); we are assuming that person is Mrs Worthing.

Weirdly, the #worthing1 tag is affixed to the wrong Worthing--to the second "Worthing" in the note instead of the first "Mr. Worthing." This is actually kind of important, because the text of the note says that we suspect that second Worthing is actually Mrs. Worthing, i.e. #worthing2.

elspethgreen commented 7 years ago

The Mlle Dufour (#dufo) record looks like the mepID in its note is right

elspethgreen commented 7 years ago

Multiple addresses and multiple notes seem to be handled well!

i-davis commented 7 years ago

all nicknames are in notes -

i-davis commented 7 years ago

(I think we've now been through all the suggestions in the list you provide above, @rlskoeser! That sound right to you two, @elspethgreen & @clmahoney? I'll keep looking around for other issues!)

rlskoeser commented 7 years ago

@elspethgreen we will need to re-open this if we need to make any changes to name handling based on #23, or if you want me to address any of the other issues you all found. However, if you all are willing to do manual cleanup on those items (which I think don't affect a huge number of records?) we could move forward.