Princeton-CDH / ppa-django

Princeton Prosody Archive v3.x - Python/Django web application
http://prosody.princeton.edu
Apache License 2.0
4 stars 2 forks source link

Thumbnails and snippets occasionally mismatched #574

Closed mnaydan closed 6 months ago

mnaydan commented 8 months ago

Describe the bug When keyword searching, the returned thumbnail image does not always match the returned snippet.

To reproduce Steps to reproduce the behavior:

  1. Go to https://prosody.princeton.edu/archive/mdp.39015021952943/?query=%22Charlotte%20Smith%22
  2. Scroll down to "Search within Volume" result
  3. See error. The thumbnail pulls a piece of poetry from p. 4. The snippet with keywords highlighted is from page 2. And the page label says p. 20.

Expected behavior Thumbnail image, snippet, and page label all should match.

Device information

Additional context We suspect this bug may be related to the HathiTrust rescanning problem, which changes the digital page sequencing.

mnaydan commented 6 months ago

We suspect that updating rsync (#453) might actually fix this problem and can check with known mismatches like the one described above

rlskoeser commented 6 months ago

I wanted to test our theory, so I ran the hathi_rsync script on this id and then reindexed pages, and it corrected the problem for the search linked in the example.

(I did this in production without thinking about it because I was doing some other indexing fixes... 🤦‍♀️ Glad it didn't cause any problems, and I'll stop now!)

rlskoeser commented 6 months ago

@mnaydan maybe this should be in review - do you want to test the record linked above on the production site? Do you have any other records with known mismatches that you want to test with? I'd be glad to run the rsync + reindex on a subset of records. Or, maybe we could just test those records in tandem with testing the hathi rsync script - it looks to me like that script might be ready to test on the staging site, once we synchronize the data between production and staging so we can compare.

mnaydan commented 6 months ago

@rlskoeser I tested the above URL and it does appear fixed (there is still something wonky with the page label, but that seems to be on HT's end -- the physical page says 2, but HT's reader says the physical page is 6). The only other case I know of off the top of my head is https://test-prosody.cdh.princeton.edu/archive/mdp.39015060429332-p150/?query=his, which is pulling the right image (p. 96) but the wrong page label (p. 93) and the wrong text (p. 92). If you wanna try rerunning the rsync & reindex on that ID, I can test.

mnaydan commented 6 months ago

The page image, label, and text for the Longfellow link all match now as expected!