RECAP Sweep Cronjob failing with missing index?

mlissner commented 2 weeks ago

Just saw this in k8s logs:

Traceback (most recent call last):
  File "/opt/courtlistener/manage.py", line 15, in <module>
    main()
  File "/opt/courtlistener/manage.py", line 11, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.13/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.13/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.13/site-packages/django/core/management/base.py", line 413, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.13/site-packages/django/core/management/base.py", line 459, in execute
    output = self.handle(*args, **options)
  File "/opt/courtlistener/cl/alerts/management/commands/cl_send_recap_alerts.py", line 730, in handle
    query_and_send_alerts(r, Alert.REAL_TIME, query_date)
  File "/opt/courtlistener/cl/alerts/management/commands/cl_send_recap_alerts.py", line 577, in query_and_send_alerts
    results, parent_results, child_results = query_alerts(search_params)
  File "/opt/courtlistener/cl/alerts/management/commands/cl_send_recap_alerts.py", line 445, in query_alerts
    return do_es_sweep_alert_query(search_query, child_search_query, search_params)
  File "/opt/courtlistener/cl/lib/elasticsearch_utils.py", line 3235, in do_es_sweep_alert_query
    responses = multi_search.execute()
  File "/usr/local/lib/python3.13/site-packages/elasticsearch_dsl/search.py", line 831, in execute
    raise ApiError("N/A", meta=responses.meta, body=r)

elasticsearch.ApiError: ApiError(200, 'N/A', 'no such index [recap_document_sweep]', recap_document_sweep, index_or_alias)

(I used chatgpt to reformat that, k9s is making it hard, so it may have hallucinations, but I think it looks OK.)

Looks like it wants an index called recap_document_sweep, which we lack. Need to investigate.

mlissner commented 2 weeks ago

Looks like we do have an index with this name in our code:

https://github.com/freelawproject/courtlistener/blob/40a771e862aaa0a2a6582c9fd8f335dfff9d3b8c/cl/search/documents.py#L1857-L1865

We don't have it in Elastic though:

We do have recap_sweep, but I think that's a different index.

@ERosendo, do you know if we missed a step here?

ERosendo commented 2 weeks ago

@mlissner I don't think we missed a step. The index should be recreated by the cronjob:

https://github.com/freelawproject/courtlistener/blob/40a771e862aaa0a2a6582c9fd8f335dfff9d3b8c/cl/alerts/management/commands/cl_send_recap_alerts.py#L294-L297

mlissner commented 2 weeks ago

Oh, there is some funkiness going on here!

I just noticed that the cronjobs aren't finishing. They should have a status of "Completed", but instead they either have "Error" on odd days or "Running" on even days:

Looking in the logs of one that's running (from ten days ago), it says:

INFO Re-indexing task scheduled ID: UkMvqJ2UQQqjcVU9Dxq7_w:768120192
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 8126.203466 seconds.
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 1108154.855345 seconds.

The one from nine days ago has the stacktrace about the index, then the one from eight days says:

INFO Re-indexing task scheduled ID: nmLgdM-aSDCkZNmyVoOodg:1550099198
INFO Task progress: 7000/3822206 documents. Estimated time to finish: 32773.302462 seconds.
INFO Task progress: 7000/3822206 documents. Estimated time to finish: 17895189.307496 seconds.

Ideas?

ERosendo commented 1 week ago

I debugged the cl_send_recap_alerts command and identified two issues:

Excessive Wait Times: We use a while loop and the sleep method to pause execution during the index migration process. We also added a helper method to estimate the remaining time for reindexing(compute_estimated_remaining_time). While this approach is generally effective, it can lead to unintended consequences when the estimated time drastically increases. The first time we ran the command, the initial estimation was 1108154.85 seconds (approximately 12 days), causing the process to remain dormant for an extended period. To mitigate this issue, we should implement an upper bound for the estimated time. This will prevent excessively long wait times and ensure the process continues as expected.
Premature Termination of recap_document_sweep Reindex: Before we can start checking for hits to send/schedule alerts, we need to complete two separate reindexing tasks: one for the recap_sweep index and another for the recap_document_sweep index. To prevent redundant reindexing in case of failures, we use two Redis flags:
- main_re_index_completed: This flag is set once we've finished adding documents to the recap_sweep index.
- rd_re_index_completed: This flag is set once we've finished adding documents to the recap_document_sweep index.
  
  Our index_daily_recap_documents method checks these flags and returns early if they're set. However, our current checks aren't specific enough, causing the main_re_index_completed flag to prematurely abort the recap_document_sweep index creation.

freelawproject / courtlistener

RECAP Sweep Cronjob failing with missing index? #4646