Princeton-CDH / ppa-django

Princeton Prosody Archive v3.x - Python/Django web application
http://prosody.princeton.edu
Apache License 2.0
4 stars 2 forks source link

Investigate viability of EEBO-TCP integration #571

Closed mnaydan closed 9 months ago

mnaydan commented 10 months ago

@rlskoeser Here is a working list of titles we are interested in adding (our undergraduate intern is actively working on this spreadsheet, FYI!). The spreadsheet includes URLs to the items and has a separate column for the volume IDs, which I took from the URL bar of the EEBO-TCP site. There are some sample excerpts in the spreadsheet, too: the pages are strange, however. Most of the pages are "unnumbered" so I took the numbers that change on the page URL following the ID, which seem to indicate the full section.

We're interested at this point in getting a sense of what the data structure is like and how easy or hard it might be to integrate it into the existing PPA structure (for instance, is the unnumbered page numbers a big blocker? is there metadata for those somewhere else that I'm not seeing?). Are image thumbnails impossible to include since they don't seem to be on the TCP site? How would we pull the metadata? Etc.

rlskoeser commented 9 months ago

@mnaydan I started looking into this last week and should have added some notes while it was fresh.

The xml structure doesn't look too bad; I think we'll have to write some custom parsing code but it doesn't need to be very complicated since we mainly want pages, page numbers, and plain text. (This is based on looking at the collection that's published on GitHub; I'm assuming they are similar.). I don't think unnumbered pages is a blocker, we'd just want a way to handle pages with no label.

I think the metadata would be similar to what we did for the Gale/ECCO records, where we'd have to rely on MARC records already purchased by PUL. Fortunately they provide a mapping for us: https://textcreationpartnership.org/using-tcp-content/eebo-tcp-cataloging-records/

I am wondering if there's a way for us to use PUL's [unadvertised] bib API to get MARC records instead of having to store a local copy that we can lookup as needed (as we did for ECCO). That would depend on whether we can look it up by the id we'll need to use and if the API is fast enough (assuming we're allowed to use for this). I expect PUL owns the MARC records, but we should confirm.

I was wondering how we get the full text, and I found in the faq that there are some bulk downloads; I'm not certain if this is what we want or not: https://textcreationpartnership.org/faq/#faq05 (There's a collection published on GitHub, but it isn't all of the texts).

They do have a list of projects using the content; we can ask them to add PPA when/if we import content: https://textcreationpartnership.org/using-tcp-content/projects-and-publications-using-tcp-texts/

I don't see any mention of images anywhere; getting access to images for thumbnails might require negotating with ProQuest (if it's even possible).

We might want to be in contact with the TCP folks at UMich to let them know what we're working on, so we could ask them for advice on how we might get thumbnail images.

mnaydan commented 9 months ago

After looking back at our ECCO metadata documentation, I emailed Joe Marciniak at PUL to request access to the EEBO MARC records via a local stored copy, XML format preferred (we decided it would be smart to repurpose the code we used for ECCO as much as possible, rather than going the API route).

I also emailed the TCP folks at tcp-info@umich.edu to ask them about the full-text and thumbnails.

rlskoeser commented 9 months ago

@mnaydan I downloaded the bulk exports and did a quick check against the ids in the spreadsheet you shared and the id lists in the text files they provide - all of our IDs are present, and the majority of them are included in phase 1, only three of them are in the phase 2 set.

jerielizabeth commented 9 months ago

Investigation task is done - should be able to integrate