Parse metadata <publisher> fields

jrladd commented 4 years ago

Using regular expressions and lxml, the process will turn this:

<biblFull>
    <publicationStmt xmlns="http://www.tei-c.org/ns/1.0">
         <publisher>reprinted by Andrew Crook and Samuel Helsham; and re to be sold by William Weston in Christ-Church-Lane,</publisher>
         <pubPlace>[Dublin :</pubPlace>
         <date>[1685]]</date>
      </publicationStmt>
</biblFull>

Into this:

<biblFull>
    <publicationStmt>
         <publisher>reprinted by <persName type="printer">Andrew Crook</persName> and <persName type="printer">Samuel Helsham</persName> to be sold by <persName type="bookseller">William Weston</persName> in <placeName>Christ-Church-Lane</placeName>,</publisher>
         <pubPlace>[Dublin :</pubPlace>
         <date>[1685]]</date>
      </publicationStmt>
</biblFull>
<listPerson>    
    <person xml:id="crookandrew" type="printer">    
        <persName>Crook, Andrew</persName>    
        <birth when = "1585" />  
        <death when =1622" />    
    </person>    
    <person xml:id="helshamsamuel" type="printer">    
        <persName>Helsham, Samuel</persName>    
    </person>    
    <person xml:id="westonwilliam" type="bookseller">    
        <persName>Weston, William</persName>    
    </person>    
</listPerson>    
<listPlace>      
    <place xml:id=locchristchurchlane">  
        <placeName>Christ Church Lane</placeName>    
    </place>
</listPlace>

This is mostly done already. Once I have the script I'll ping @dknoxwu and others to double check it, and then we can use it to officially modify the metadata files.

n.b. This transformation is only occurring on the standalone metadata files in the epmetadata repository. The XML headers on the full text files will remain the same.

Once the publication field is done, we can start to do some interesting analyses of these printers and publishers. We can also work on cleaning up the date field.

jrladd commented 4 years ago

This is done! I created a new imprintparse branch in epmetadata, but I don't seem to have permission to push to that repo. @dknoxwu could you add me?

Once that's up, could @dknoxwu and/or @spenteco take a look at this code and its output? I should be accounting for most of the language in the <publisher> fields, but I'd be grateful if someone spotted anything that fell through the cracks. The new code is in bin/parse_imprint.py, and it includes some commented-out lines that will generate a CSV instead of the XML output. Thanks!

jrladd commented 4 years ago

Thanks, @dknoxwu! Code now available in the imprintparse branch.

jrladd commented 4 years ago

The parsing is finished and the results are in the imprintparse branch as we discussed! I opted to create new files in a parsedmeta directory of the form A00000_parsed.xml. These files contain the full metadata from the original files in sourcemeta, but with new markup for publishers, printers, booksellers, locations, and more.

To reiterate, the process was designed to get most of the way toward clean publisher metadata. Now that the parse is done, we can think about how we want to proceed in the inevitable hand-cleaning that will need to occur. I'm hopeful that the parsing has made it so that human intervention can be minimal. Once we have data we are really confident in, we can merge this with the master branch.

In the new year, I'll begin to work with this data and write up the process in a series of notebooks/posts on the uses of EarlyPrint metadata. Among other things, I'm hoping to do some network analyses of printers and publishers. Doing this will also be some QC on this process: as I work with the data I'll probably catch a few more issues in the parsing process.

Even though the general work of cleaning metadata is still ongoing, I'm excited we've reached this stage and am closing this issue!

earlyprint / earlyprint.github.io

Parse metadata <publisher> fields #5