argonne-lcf / balsam

High throughput workflows and automation for HPC
77 stars 21 forks source link

bulk_update of jobs fails intermittently #383

Open cms21 opened 11 months ago

cms21 commented 11 months ago

This has been encountered by a few users. There are times when bulk updates of job statuses fail. For example, advancing jobs from STAGED_IN to PREPROCESSED or RUN_DONE to POSTPROCESSED. I've not been able to figure out what triggers this issue, but once a site encounters this issue, it remains persistent for all jobs in the site. Restarting the site and/or restarting the server does not help. The only solution I've been able to find is to update jobs individually, but this is tedious and the site typically will continue to have the issue for new jobs.

On the client side, logs contain this error:

2023-08-31 18:36:00.734 | 29265 | INFO | balsam.site.job_source:142] JobSource acquired 5 jobs:
2023-08-31 18:36:01.498 | 29262 | WARNING | balsam.client.requests_client:109] Attempt retry (0 of 10) of connection: 500 Server Error: Internal Server Error for url: https://balsam-dev.alcf.anl.gov/jobs/
2023-08-31 18:36:03.305 | 29262 | WARNING | balsam.client.requests_client:109] Attempt retry (1 of 10) of connection: 500 Server Error: Internal Server Error for url: https://balsam-dev.alcf.anl.gov/jobs/
2023-08-31 18:36:06.328 | 29262 | WARNING | balsam.client.requests_client:109] Attempt retry (2 of 10) of connection: 500 Server Error: Internal Server Error for url: https://balsam-dev.alcf.anl.gov/jobs/
2023-08-31 18:36:08.263 | 29265 | INFO | balsam.site.job_source:142] JobSource acquired 30 jobs:
2023-08-31 18:36:11.182 | 29262 | WARNING | balsam.client.requests_client:109] Attempt retry (3 of 10) of connection: 500 Server Error: Internal Server Error for url: https://balsam-dev.alcf.anl.gov/jobs/

It will continue retrying but never succeed. The only way to resolve it is to change the job states one at a time with job.save(). Extending the number of retries does not help.

On the server side, logs contain this error:

2023-08-31 04:04:48.457 | 38 | ERROR | balsam.server.utils.timer:75] Exception handling request balsam.server.routers.jobs.bulk_update
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/anyio/streams/memory.py", line 94, in receive
    return self.receive_nowait()
  File "/usr/local/lib/python3.10/site-packages/anyio/streams/memory.py", line 89, in receive_nowait
    raise WouldBlock
anyio.WouldBlock

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/balsam/balsam/server/utils/timer.py", line 72, in dispatch
    response = await call_next(request)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 35, in call_next
    message = await recv_stream.receive()
  File "/usr/local/lib/python3.10/site-packages/anyio/streams/memory.py", line 102, in receive
    await receive_event.wait()
asyncio.exceptions.CancelledError

I'm not sure what the server side error indicates, but that's all that's apparent in server-balsam.log. This same error over and over. Perhaps @masalim2 has some ideas?

masalim2 commented 11 months ago

I think this issue stems from the timer middleware, as this part of the traceback is in our code stack:

  File "/balsam/balsam/server/utils/timer.py", line 72, in dispatch
    response = await call_next(request)

This middleware was just there to log request times, which was handy for optimizing queries, but can be removed with no detriment otherwise. I believe this is a known issue that's been solved with more recent versions of fastapi (and its underlying web server starlette):

I would suggest to try a couple of things:

  1. Remove the timer middleware by removing the call to add_middleware in the server: https://github.com/argonne-lcf/balsam/blob/main/balsam/server/main.py#L117 .
  2. Update the container software versions (discussion below)

To do a container software update, I would update the versions in requirements/deploy.txt, because that's what the Dockerfile is installing dependencies from. They are currently pinned to pretty old versions.