emory-libraries / blacklight-catalog

1 stars 2 forks source link

Deleted Alma records from Dec. 2021 still in Blacklight index #1224

Closed eporter23 closed 2 years ago

eporter23 commented 2 years ago

We've been responding to user feedback about records with no holdings appearing in Blacklight. When this happens, it also causes search results not to load availability (#1213). Sofia has also been working to delete or suppress records with no holdings. She shared a report with me that contains deleted records from the past year (65,000 records).

We see a pattern in early December 2021 where many deleted records were not removed from SOLR. Possibly related work at this time was happening in #1060. While it is not feasible to spot check all of these in December (~1600 records), samples for each day confirm that deleted records from the following time range in particular are still in SOLR: 12/1/21-12/9/21

What is our best option for reindexing these?

Here is a filtered list of IDs for the month of December 2021 specifically since that seems to be a particular problem area.

abelemlih commented 2 years ago

@eporter23 @lovinscari all documents in the spreadsheet, either listed exists or not classified, were reindexed individually, and should now be deleted from Solr.

abelemlih commented 2 years ago

@lovinscari @eporter23 I had a conversation with @lisahamlett, and we both agreed this issue with deleted records appearing on Solr could be the result of incremental indexing missing items that were recently deleted. For example, if incremental indexing were running in a given time window, and an item were deleted and updated in OAI during that time window, if that update is not accounted for during incremental indexing, it will never be accounted for because incremental indexing in the next time window will never get that data from OAI.

One possible solution for this issue in the long term is running a separate indexing cron job every day at midnight that reindexes all items from the previous day. One issue with this solution is the duplicate reindexing of items that were accounted for during incremental indexing, but in most cases we are not reindexing a large number of items during incremental indexing, so it should not be an issue. Another solution is using two Solr instances, which @rotated8 mentioned is a solution we are looking to implement in the long term, but that will involve input from the devops team.

Let me know what solution you prefer we move forward with, and please reach out to me if you have any questions about the feedback above.

lovinscari commented 2 years ago

@eporter23 - Let's review this during our 1-2-1 meeting on 2-8-22

lovinscari commented 2 years ago

Reviewed with @eporter23 and we will add. anew ticket for the solution