UW-Madison-DSI / ask-xDD

Retrieval-Augmented Generation (RAG) on 17M full text journal articles.
https://xdd.wisc.edu/
MIT License
2 stars 2 forks source link

Resolve missing docid #116

Open JasonLo opened 5 months ago

JasonLo commented 5 months ago

During our routine weekly data ingestion, we encountered an unusual issue with a subset of document identifiers (docids). Specifically, we identified 3659 instances where docids could be successfully retrieved via the xdd API endpoint. However, attempting to locate these same docids through direct access to Elasticsearch resulted in 404 errors, indicating that the documents were not found.

@iross can you take a look at COSMOS1 /hdd/clo36/repo/ask-xDD/notebooks/housekeeping/docids_404.ipynb

iross commented 5 months ago

My gut was a bit wrong here... I'd said Monday that I suspected that this issue was due to desyncs due to duplication clean-ups, but it's really just that the ES7 has fallen behind the older instance.

The docids encode when they were added to xDD, so looking at /hdd/clo36/repo/ask-xDD/tmp/docids_404.txt made it clear that they're all recent (except that first one.. which remains a small mystery). Looking here, it's clear that nothing new has been added to the newer ES instance since mid-February. At that time, I'd been working on transitioning everything over so that ES7 and the kubernetes-backed mongodb was the source of truth, but paused that transition to stay stable through the ASKEM hackathon and never picked it back up :( .

Next week I'm hoping to finish up that cutover and make that instance the default everywhere because maintaining two separate instances is a recipe for endless issues like this.

(EDIT: Whoops, just noticed that it was an unsorted list. What I said holds true for most docids in that list. ~85 appear to be old enough so that I would have expected them to exist in ES7, so some digging still required)