Closed mlissner closed 2 days ago
Looks like we do have an index with this name in our code:
We don't have it in Elastic though:
We do have recap_sweep, but I think that's a different index.
@ERosendo, do you know if we missed a step here?
@mlissner I don't think we missed a step. The index should be recreated by the cronjob:
Oh, there is some funkiness going on here!
I just noticed that the cronjobs aren't finishing. They should have a status of "Completed", but instead they either have "Error" on odd days or "Running" on even days:
Looking in the logs of one that's running (from ten days ago), it says:
INFO Re-indexing task scheduled ID: UkMvqJ2UQQqjcVU9Dxq7_w:768120192
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 8126.203466 seconds.
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 1108154.855345 seconds.
The one from nine days ago has the stacktrace about the index, then the one from eight days says:
INFO Re-indexing task scheduled ID: nmLgdM-aSDCkZNmyVoOodg:1550099198
INFO Task progress: 7000/3822206 documents. Estimated time to finish: 32773.302462 seconds.
INFO Task progress: 7000/3822206 documents. Estimated time to finish: 17895189.307496 seconds.
Ideas?
I debugged the cl_send_recap_alerts
command and identified two issues:
Excessive Wait Times: We use a while loop and the sleep method to pause execution during the index migration process. We also added a helper method to estimate the remaining time for reindexing(compute_estimated_remaining_time
). While this approach is generally effective, it can lead to unintended consequences when the estimated time drastically increases. The first time we ran the command, the initial estimation was 1108154.85 seconds (approximately 12 days), causing the process to remain dormant for an extended period. To mitigate this issue, we should implement an upper bound for the estimated time. This will prevent excessively long wait times and ensure the process continues as expected.
Premature Termination of recap_document_sweep Reindex: Before we can start checking for hits to send/schedule alerts, we need to complete two separate reindexing tasks: one for the recap_sweep
index and another for the recap_document_sweep
index. To prevent redundant reindexing in case of failures, we use two Redis flags:
main_re_index_completed
: This flag is set once we've finished adding documents to the recap_sweep
index.
rd_re_index_completed
: This flag is set once we've finished adding documents to the recap_document_sweep
index.
Our index_daily_recap_documents
method checks these flags and returns early if they're set. However, our current checks aren't specific enough, causing the main_re_index_completed
flag to prematurely abort the recap_document_sweep
index creation.
Just saw this in k8s logs:
(I used chatgpt to reformat that, k9s is making it hard, so it may have hallucinations, but I think it looks OK.)
Looks like it wants an index called recap_document_sweep, which we lack. Need to investigate.