As a developer, I want to one-time bulk fix HathiTrust excerpt page ranges from a spreadsheet so that we can pull correct page content when we reindex.

mnaydan commented 5 months ago

[x] ~adapt import script~ write script to update digital pages
[x] run against this spreadsheet

acceptance criteria

[x] updates digital page range in database
[x] creates log entry documenting the change
[x] reindexes pages in Solr based on new range
[x] appropriate/reasonable error handling and reporting

rlskoeser commented 5 months ago

@mnaydan I'm writing the script to require a CSV that includes source_id, pages_orig, and new_pages_digital. I also prefer we do filtering on the CSV before using it with the script (it should only include rows for records we want updated). Does that sound ok to you?

For testing, I downloaded the first tab from Google Sheets, filtered out all the rows where the digital page range was marked as "correct", and renamed the "new digital range" to new_pages_digital.

In case it's useful, I used grep to filter out correct rows:

grep --invert ,correct, excerpt_updates.csv > excerpt_update_changes.csv

I didn't know that one of the rows was marked as "SUPPRESS" but it turned out to be useful for testing error handling (that record failed to save).

mnaydan commented 5 months ago

@rlskoeser that sounds good generally, but here's the needle in the haystack... there was at least one case where original page range changed (roman numerals) due to a typo I discovered in the original input. Will original page range need to match the database field exactly? Would fixing the typo in the database resolve the issue? If there's quick error handling for any matches NOT found that would be useful.

rlskoeser commented 5 months ago

@mnaydan script reports on matches that are not found - I had one where the original page range was slightly different in the spreadsheet that in my copy of the database (which is probably out dated). Updating the incorrect page range in the database would resolve that problem. I thought looking for exact match would be best.

mnaydan commented 5 months ago

@rlskoeser okay perfect, I'll update that page range in the database and if it reports on no matches then we are good.

rlskoeser commented 5 months ago

I ran the script in staging with this CSV file as input (generated from the google doc version as noted above): excerpt_update_changes.csv

Here is how I ran the script and the summary output:

$ ./manage.py adjust_excerpts /tmp/excerpt_update_changes.csv
Error saving mdp.39015036664038 (SUPPRESS): Can't parse chunk 'SUPPRESS'
No record found for source id uc1.c2641998 and pages_orig 32-33, 66

Updated 119 records. 0 unchanged, 1 not found, 1 error.

I ran this in staging without refreshing from production because I wanted it to have the recent-ish rsync changes. (I did not run rsync immediately before).

If it's helpful for testing, you could update the original pages for the not found record in the staging database and I can run this script again. I could also run rsync. At some point before we release, we may want to test the full set of steps we will be doing in production (perhaps after we fix excerpt ids): replicate production to staging, rsync, update excerpts.

If I run the script again with the same input, it recognizes that it doesn't need to make changes:

$ ./manage.py adjust_excerpts /tmp/excerpt_update_changes.csv
Error saving mdp.39015036664038 (SUPPRESS): Can't parse chunk 'SUPPRESS'
No record found for source id uc1.c2641998 and pages_orig 32-33, 66

Updated 0 records. 119 unchanged, 1 not found, 1 error.

mnaydan commented 5 months ago

@rlskoeser this is really helpful, thanks! Let me fix the errors in production and staging and then we can re-run... give me a moment.

mnaydan commented 5 months ago

@rlskoeser how is it handling null original page number?

rlskoeser commented 5 months ago

@mnaydan I don't know! Probably not correctly, I didn't know that was a possible case. LMK what it should do - at a minimum I can make sure to filter out full works when we look for matches...

mnaydan commented 5 months ago

@rlskoeser OK good to know. It's not a full work, it just doesn't have any physical page numbers printed on the pages! I will just fix that one manually on the backend and then you shouldn't have to change anything with the script. I think it's the only blank.

Edit: I went in to change it and it looks like your script fixed it already in QA! There's no other excerpts associated with that work, just the one.

rlskoeser commented 5 months ago

@mnaydan worth checking what it did when you test the script; if it's the only excerpt from that volume it may have done the right thing.

mnaydan commented 5 months ago

@rlskoeser great minds! Ok, can we run it again? I am expecting the uc1.c2641998 error to be handled now, and 2 additional updated records.

rlskoeser commented 5 months ago

Regenerated a test CSV from the google sheets (forgot to exclude the suppressed one) and ran again; here is the output:

$ ./manage.py adjust_excerpts /tmp/excerpt_update_changes_v2.csv
Error saving mdp.39015036664038 (SUPPRESS): Can't parse chunk 'SUPPRESS'

Updated 3 records. 117 unchanged, 0 not found, 1 error.

mnaydan commented 5 months ago

@rlskoeser yay! This is exactly what I expected. Do you want to close and track testing the full set of steps elsewhere?

mnaydan commented 5 months ago

@rlskoeser wait, I just saw your full set of acceptance criteria. I'm getting clearly bleary-eyed after a long week... let me test all those steps.

mnaydan commented 5 months ago

I spot checked a few records, everything looks great! I tested single page, changed range, unchanged/correct range, discontinuous page numbers, ark:/ IDs, the one blank original page range record... they all appear in the database and indexed how I would expect in all cases.

Princeton-CDH / ppa-django

As a developer, I want to one-time bulk fix HathiTrust excerpt page ranges from a spreadsheet so that we can pull correct page content when we reindex. #625

acceptance criteria