Princeton-CDH / ppa-django

Princeton Prosody Archive v3.x - Python/Django web application
http://prosody.princeton.edu
Apache License 2.0
4 stars 2 forks source link

As an admin, I want a bulk import of metadata and full text from EEBO-TCP works so that I can add content to the site that is not available from HathiTrust or Gale/ECCO. #600

Closed mnaydan closed 3 months ago

mnaydan commented 8 months ago

Equivalent to ECCO issue #369.

73 records for import are listed in this Google Sheet: FYI--the metadata in which was gathered manually from the TCP site.

We'll need to look at the ProQuest metadata records we received from Joe Marciniak on 2/16/24 via email, the current "old" mappings (documented here), and see whether they work or not. My guess is no, since the metadata records we received from ProQuest are all 2020 onward. We may either need to wait until Paul gives us his version of the ProQuest MARC or an improved set of mappings to map the ProQuest records onto the TCP data.

rlskoeser commented 6 months ago

@mnaydan documenting my questions so far while my notes still make sense to me:

The spreadsheet you shared has some duplicate items. In some cases it's a slightly different version of the url (name.umdl.umich vs quod.lib.umich). It looks to me like most of these are intended to be different excerpts from the same volume, but they aren't listed as excerpts and don't have any page information. I've removed the dupes from my working copy so I can proceed with developing the import, but wanted to flag for you.

Do you want the notes field in the spreadsheet imported into admin / curator notes or ignored?

The only collection info I see is in the spreadsheet is "OB?". My current import logic is associating items with the Original Bibliography collection on import. Is this sufficient? Do you want to do anything else with regard to collections?

Questions about excerpts:

And a question about page content: I noticed that there are some end pages that have newline / whitespace content only, does it make sense to you to skip adding these to Solr? (... this might be simpler to leave for now, it doesn't do any harm and I'd have to think about how to do the logic correctly; but we should think about skipping empty pages when we generate the text corpus)

mnaydan commented 5 months ago

Answers to your questions @rlskoeser :

  1. Duplicate items. I removed these from the spreadsheet. I'm not sure how they got there. Thank you for flagging.
  2. Notes field. Please import these into admin notes IF AND ONLY IF it is easy. If it will take more than 5 minutes, ignore them.
  3. Collections. Yes, associating items marked with a 'Y' in the OB column with Original Bibliography upon import is sufficient. I will ask Lottie to assign works to collections in the backend after developer import is complete.
  4. Questions about excerpts.
    • "Sequence number" was intended to be digital page range, but it appears the EEBO-TCP interface changed since we gathered those numbers, and I actually can't find them on the front end anymore.
    • You can ignore pub info from spreadsheet and pull from MARC instead.
    • "Section identifier" is what appears to be the unique identifier of a section in the URL in EEBO-TCP, in lieu of page numbers.
    • I can add sort titles in a column in the spreadsheet.
  5. Page content. If it's simpler, let's leave the whitespace in Solr. The question of what to do with empty pages is one we are addressing for the larger text corpus, anyway.

Additional findings:

  1. Excerpts 404 on the front end. I can see limited information to identify excerpts on the EEBO-TCP front end. Can you please tell me what information you need to pull an excerpt correctly in lieu of physical and digital page numbers like we are used to having? The section identifier (between the volume ID and the ? in the URL, as in this example) is the closest thing I have with this new interface.
  2. Two items failed to get imported into staging, and I don't know why. They are A01225 and A12231.

Let me know if you want more thorough testing on the backend; but I thought I would prioritize getting answers to these questions, and I'll work on the sort title now.

mnaydan commented 5 months ago

I added sort titles for the excerpts, as well as a field for Book/journal title because I realized that field wasn't being imported correctly (it was being lumped into the title rather than separated out because of the way we structured the data).

rlskoeser commented 3 months ago

@mnaydan thanks for all these answers to my questions and updated spreadsheet.

I've updated the eebo import code to pull sort title and book/journal title from the spreadsheet, and removed the comments and TODOs in my code where you've confirmed what I implemented on the first pass (admin notes, collection assignment, MARC metadata).

I also figured out why the excerpts were giving you a 404 - we didn't previously support : as a page delimiter; I've adjusted the regex to handle that.

The two items that didn't get imported before are showing up in my test import; maybe I accidentally excluded from my test set when I dropped duplicates.

I think it would be best if we look together at the section identifiers and page ranges. I also have a related question about how we should link to the eebo-tcp version of a record (I guess that is a question for #606).

Should I put the updated eebo import in staging now or do you want to look at excerpt section and page ranges together first?

mnaydan commented 3 months ago

@rlskoeser let's wait until we look at the excerpt section/page ranges together, unless you think that looking at the updated eebo import in staging would help us think through/decide on answers to your remaining questions.

mnaydan commented 3 months ago

@rlskoeser I updated the spreadsheet to include the adjustments we discussed:

I can test whenever it's in staging.

mnaydan commented 3 months ago

As far as I'm concerned the bulk import of metadata and full text from a script was successful, and this issue can be closed. 66 works were listed in the spreadsheet for import, and 66 works were successfully imported and indexed, and logged in the backend. Minor issues I've found are being tracked on other open issues.