NEU-Libraries / cerberus

Digital Repository Service
8 stars 0 forks source link

Spreadsheet loaders #712

Closed sarahjeansweeney closed 8 years ago

sarahjeansweeney commented 9 years ago

Spreadsheet loader for just metadata: Archives staff are digitizing materials and depositing directly into the DRS (see https://repository.library.northeastern.edu/collections/neu:rx913r06d and https://repository.library.northeastern.edu/collections/neu:rx913v50k). After they upload the file they enter metadata into a Google form, which populates a Google spreadsheet. We will need a loader to process the spreadsheet, create metadata records, and replace the original stub records with the new metadata.

Spreadsheet loader for metadata and files: Same as above, but the files will be loaded along with the metadata.

elizoller commented 8 years ago

screen shot 2016-05-03 at 11 16 37 am ^ what I've come up with so far for a "preview" page on the MODS loader

sarahjeansweeney commented 8 years ago

This looks great so far. I have a few questions/suggestions:

Is the XML editable? I think we talked about this yesterday but somehow my memory is fuzzy.

elizoller commented 8 years ago
elizoller commented 8 years ago

Initial pass at mods diff screen shot 2016-05-03 at 3 35 26 pm

sarahjeansweeney commented 8 years ago

Yes, the load start and depositor names changes are helpful. Preview looks great!

elizoller commented 8 years ago

@sarahjeansweeney for fields with authorities - how do you want to handle those? will they be another field in the spreadsheet? then we can just assign them like we assign any other field value.

elizoller commented 8 years ago

@sarahjeansweeney for the "place of publication" field in the spreadsheet, we currently have two fields under place - one for city and one for state (both have the field name placeTerm but with different attributes)

t.place(path: 'place', namespace_prefix: 'mods'){
  t.city_term(path: 'placeTerm', namespace_prefix: 'mods', attributes: { type: 'text' })
  t.state_term(path: 'placeTerm', namespace_prefix: 'mods', attributes: { type: 'code', authority: 'marccountry' })
}

I am wondering if we "made up" this distinction with city and state with different attributes? Would you like to continue this method? If so, I would advocate for making city and state as separate fields in the spreadsheet. Parsing based on the comma would be unreliable in the case where there is no comma and how would we know if it is city or state.

elizoller commented 8 years ago

@sarahjeansweeney what does the "reformatting quality" field in the spreadsheet map to in mods?

sarahjeansweeney commented 8 years ago

re: authorities: This just came up the other day, and it was decided we would add a new column to the spreadsheet so the catalogers could explicitly state the value's authority.

re: place: As far as I can tell, the place of publication value shouldn't need to be split into city and state. It should map to the originInfo/place/placeTerm element, which doesn't have subelements. subject/hierarchicalGeographic terms do split into city, state, country, etc subelements.

sarahjeansweeney commented 8 years ago

@elizoller re: reformattingQuality is a subelement of physicalDescription, i.e.:

   <mods:physicalDescription>
      <mods:extent>1 postcard : 9 x 14.2 cm.</mods:extent>
      <mods:digitalOrigin>reformatted digital</mods:digitalOrigin>
      <mods:reformattingQuality>access</mods:reformattingQuality>
   </mods:physicalDescription>

It only has three allowed values: access, preservation, replacement

elizoller commented 8 years ago

Ok, thanks. On further investigation of the city_term/state_term thing, the only place in the code that seems to be using this functionality is the iptc loaders which pass iptc city and state values (iptc stores them separately). Should I merge those into a single string and store it as a single placeTerm in the mods? (And keep the mods spreadsheet handling with a single place field)

sarahjeansweeney commented 8 years ago

@elizoller yes, for the originInfo placeTerm field they should be in a single string. the spreadsheet will stay the same.

elizoller commented 8 years ago

for table of contents are we just doing top level element like

<mods:tableOfContents>text goes here</mods:tableOfContents>
sarahjeansweeney commented 8 years ago

Yup, table of contents is pretty simple.

sarahjeansweeney commented 8 years ago

Thinking about user experience for all our new loaders, let's organize them based on what the user is doing, not by uploaded file type:

Metadata Overwrite Tool

New File Loader

sarahjeansweeney commented 8 years ago

I'm running into a system error when I try to use the spreadsheet loader on staging:

2016-05-26_1024

dgcliff commented 8 years ago

I'll bet a very large drink with a tiny umbrella in it, it's because it's Thursday. I'll restart some things.

sarahjeansweeney commented 8 years ago

FWIW it was working earlier.

elizoller commented 8 years ago

Staging should be fixed now.

sarahjeansweeney commented 8 years ago

Two spreadsheet fields aren't being processed into MODS: Supplied title ("Is this a supplied title?") and any of the name affiliation fields. I'll send the spreadsheet over slack, but the resulting record is here: http://cerberus.library.northeastern.edu/files/neu:nz806157s

elizoller commented 8 years ago

WHEN YOU DEPLOY: Run console loop to mark all previous load reports as completed = true (default value is false)

sarahjeansweeney commented 8 years ago

I tested the upload spreadsheet + file upload process this morning and there are a few issues with how the MODS is being generated.

I didn't test a full MODS spreadsheet, just what lives in the Board of Trustees spreadsheet, but I'll try that next to make sure it wasn't just how the spreadsheet was formatted. I'll also share the spreadsheet I used over slack.

sarahjeansweeney commented 8 years ago

Here are the records I created, if that helps: http://cerberus.library.northeastern.edu/files/neu:nz8062499 http://cerberus.library.northeastern.edu/files/neu:nz8062545 http://cerberus.library.northeastern.edu/files/neu:nz806252m

dgcliff commented 8 years ago

@elizoller do you mean all previous load reports?

elizoller commented 8 years ago

Yeah sorry

sarahjeansweeney commented 8 years ago

Just tried another load. The preview page displayed the metadata as expected, with relatedItems and good dates, etc:

2016-05-27_1433

But when the record loaded, it loaded with the same MODS issues described above (http://cerberus.library.northeastern.edu/files/neu:nz806286f): 2016-05-27_1436

dgcliff commented 8 years ago

Probably caching Sarah, which will be a quick patch. I'll manually expunge it to see if it works

sarahjeansweeney commented 8 years ago

Getting this message now: 2016-05-27_1514-1

elizoller commented 8 years ago

I think this is because there is an empty column header. I will do a better check for that.

sarahjeansweeney commented 8 years ago

Just tested again and noticed a few other things: Personal name: The given name for a personal name creator field was inserted in the MODS valueURI attribute: image became

    <mods:name type="personal" valueURI="Sarah">
        <mods:namePart type="given">Sarah</mods:namePart>
        <mods:namePart type="family">Sweeney</mods:namePart>
        <mods:namePart/>
        <mods:role>
            <mods:roleTerm authority="marcrelator" authorityURI="http://id.loc.gov/vocabulary/relators" type="text">Creator</mods:roleTerm>
        </mods:role>
        <mods:namePart type="termsOfAddress">Ms.</mods:namePart>
        <mods:namePart type="date">1923 -</mods:namePart>
    </mods:name>

Corporate name: I may not have formatted this correctly, but the role and the URIs are missing from corporate name fields: image Became:

    <mods:name type="corporate" usage="primary">
        <mods:namePart>Northeastern University (Boston, Mass.). Board of Trustees</mods:namePart>
    </mods:name>

Subject Topic: The authority value and URI were inserted in the same attribute:

image

Became

 <mods:subject authority="lcsh | URI">
        <mods:topic>College trustees</mods:topic>
        <mods:topic>Massachusetts</mods:topic>
        <mods:topic>Boston</mods:topic>
    </mods:subject>

Subject Name: The field value for subject name was inserted into the ValueURI attribute, with a \ to escape the apostrophe:

image

Became

<mods:subject>
        <mods:name type="corporate" authority="lcsh" valueURI="Boston Young Men\'s Christian Association">
            <mods:namePart>Boston Young Men's Christian Association</mods:namePart>
        </mods:name>
    </mods:subject>
sarahjeansweeney commented 8 years ago

Let's use "topical subject heading" and "name subject heading" for consistency in the subject column headers.

elizoller commented 8 years ago

Fixes are in for the above issues with URIs and deployed to staging (e04af55d5cf6425061bd30d650c1fc6cf030e10c)

sarahjeansweeney commented 8 years ago

Here's the grouper group for loaders: northeastern:drs:repository:loaders:spreadsheet We'll sort out the full permissions for all the loaders later.

dgcliff commented 8 years ago

Should be finished with ca6ca2a5fb66610f6c805b537d06d6b55d45fff2