freelawproject / courtlistener

A fully-searchable and accessible archive of court data including growing repositories of opinions, oral arguments, judges, judicial financial records, and federal filings.
https://www.courtlistener.com
Other
545 stars 151 forks source link

RECAP Sweep Cronjob failing with missing index? #4646

Open mlissner opened 5 hours ago

mlissner commented 5 hours ago

Just saw this in k8s logs:

Traceback (most recent call last):
  File "/opt/courtlistener/manage.py", line 15, in <module>
    main()
  File "/opt/courtlistener/manage.py", line 11, in main
    execute_from_command_line(sys.argv)
  File "/usr/local/lib/python3.13/site-packages/django/core/management/__init__.py", line 442, in execute_from_command_line
    utility.execute()
  File "/usr/local/lib/python3.13/site-packages/django/core/management/__init__.py", line 436, in execute
    self.fetch_command(subcommand).run_from_argv(self.argv)
  File "/usr/local/lib/python3.13/site-packages/django/core/management/base.py", line 413, in run_from_argv
    self.execute(*args, **cmd_options)
  File "/usr/local/lib/python3.13/site-packages/django/core/management/base.py", line 459, in execute
    output = self.handle(*args, **options)
  File "/opt/courtlistener/cl/alerts/management/commands/cl_send_recap_alerts.py", line 730, in handle
    query_and_send_alerts(r, Alert.REAL_TIME, query_date)
  File "/opt/courtlistener/cl/alerts/management/commands/cl_send_recap_alerts.py", line 577, in query_and_send_alerts
    results, parent_results, child_results = query_alerts(search_params)
  File "/opt/courtlistener/cl/alerts/management/commands/cl_send_recap_alerts.py", line 445, in query_alerts
    return do_es_sweep_alert_query(search_query, child_search_query, search_params)
  File "/opt/courtlistener/cl/lib/elasticsearch_utils.py", line 3235, in do_es_sweep_alert_query
    responses = multi_search.execute()
  File "/usr/local/lib/python3.13/site-packages/elasticsearch_dsl/search.py", line 831, in execute
    raise ApiError("N/A", meta=responses.meta, body=r)

elasticsearch.ApiError: ApiError(200, 'N/A', 'no such index [recap_document_sweep]', recap_document_sweep, index_or_alias)     

(I used chatgpt to reformat that, k9s is making it hard, so it may have hallucinations, but I think it looks OK.)

Looks like it wants an index called recap_document_sweep, which we lack. Need to investigate.

mlissner commented 5 hours ago

Looks like we do have an index with this name in our code:

https://github.com/freelawproject/courtlistener/blob/40a771e862aaa0a2a6582c9fd8f335dfff9d3b8c/cl/search/documents.py#L1857-L1865

We don't have it in Elastic though:

Image

We do have recap_sweep, but I think that's a different index.

@ERosendo, do you know if we missed a step here?

ERosendo commented 5 hours ago

@mlissner I don't think we missed a step. The index should be recreated by the cronjob:

https://github.com/freelawproject/courtlistener/blob/40a771e862aaa0a2a6582c9fd8f335dfff9d3b8c/cl/alerts/management/commands/cl_send_recap_alerts.py#L294-L297

mlissner commented 5 hours ago

Oh, there is some funkiness going on here!

I just noticed that the cronjobs aren't finishing. They should have a status of "Completed", but instead they either have "Error" on odd days or "Running" on even days:

Image

Looking in the logs of one that's running (from ten days ago), it says:

INFO Re-indexing task scheduled ID: UkMvqJ2UQQqjcVU9Dxq7_w:768120192
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 8126.203466 seconds.
INFO Task progress: 22000/3000097 documents. Estimated time to finish: 1108154.855345 seconds.   

The one from nine days ago has the stacktrace about the index, then the one from eight days says:

INFO Re-indexing task scheduled ID: nmLgdM-aSDCkZNmyVoOodg:1550099198
INFO Task progress: 7000/3822206 documents. Estimated time to finish: 32773.302462 seconds.
INFO Task progress: 7000/3822206 documents. Estimated time to finish: 17895189.307496 seconds.

Ideas?