Closed mnaydan closed 3 months ago
@mnaydan documenting my questions so far while my notes still make sense to me:
The spreadsheet you shared has some duplicate items. In some cases it's a slightly different version of the url (name.umdl.umich vs quod.lib.umich). It looks to me like most of these are intended to be different excerpts from the same volume, but they aren't listed as excerpts and don't have any page information. I've removed the dupes from my working copy so I can proceed with developing the import, but wanted to flag for you.
Do you want the notes field in the spreadsheet imported into admin / curator notes or ignored?
The only collection info I see is in the spreadsheet is "OB?". My current import logic is associating items with the Original Bibliography collection on import. Is this sufficient? Do you want to do anything else with regard to collections?
Questions about excerpts:
And a question about page content: I noticed that there are some end pages that have newline / whitespace content only, does it make sense to you to skip adding these to Solr? (... this might be simpler to leave for now, it doesn't do any harm and I'd have to think about how to do the logic correctly; but we should think about skipping empty pages when we generate the text corpus)
Answers to your questions @rlskoeser :
Additional findings:
Let me know if you want more thorough testing on the backend; but I thought I would prioritize getting answers to these questions, and I'll work on the sort title now.
I added sort titles for the excerpts, as well as a field for Book/journal title because I realized that field wasn't being imported correctly (it was being lumped into the title rather than separated out because of the way we structured the data).
@mnaydan thanks for all these answers to my questions and updated spreadsheet.
I've updated the eebo import code to pull sort title and book/journal title from the spreadsheet, and removed the comments and TODOs in my code where you've confirmed what I implemented on the first pass (admin notes, collection assignment, MARC metadata).
I also figured out why the excerpts were giving you a 404 - we didn't previously support :
as a page delimiter; I've adjusted the regex to handle that.
The two items that didn't get imported before are showing up in my test import; maybe I accidentally excluded from my test set when I dropped duplicates.
I think it would be best if we look together at the section identifiers and page ranges. I also have a related question about how we should link to the eebo-tcp version of a record (I guess that is a question for #606).
Should I put the updated eebo import in staging now or do you want to look at excerpt section and page ranges together first?
@rlskoeser let's wait until we look at the excerpt section/page ranges together, unless you think that looking at the updated eebo import in staging would help us think through/decide on answers to your remaining questions.
@rlskoeser I updated the spreadsheet to include the adjustments we discussed:
I can test whenever it's in staging.
As far as I'm concerned the bulk import of metadata and full text from a script was successful, and this issue can be closed. 66 works were listed in the spreadsheet for import, and 66 works were successfully imported and indexed, and logged in the backend. Minor issues I've found are being tracked on other open issues.
Equivalent to ECCO issue #369.
73 records for import are listed in this Google Sheet: FYI--the metadata in which was gathered manually from the TCP site.
We'll need to look at the ProQuest metadata records we received from Joe Marciniak on 2/16/24 via email, the current "old" mappings (documented here), and see whether they work or not. My guess is no, since the metadata records we received from ProQuest are all 2020 onward. We may either need to wait until Paul gives us his version of the ProQuest MARC or an improved set of mappings to map the ProQuest records onto the TCP data.