bulk_update of jobs fails intermittently

This has been encountered by a few users. There are times when bulk updates of job statuses fail. For example, advancing jobs from STAGED_IN to PREPROCESSED or RUN_DONE to POSTPROCESSED. I've not been able to figure out what triggers this issue, but once a site encounters this issue, it remains persistent for all jobs in the site. Restarting the site and/or restarting the server does not help. The only solution I've been able to find is to update jobs individually, but this is tedious and the site typically will continue to have the issue for new jobs.

On the client side, logs contain this error:

2023-08-31 18:36:00.734 | 29265 | INFO | balsam.site.job_source:142] JobSource acquired 5 jobs:
2023-08-31 18:36:01.498 | 29262 | WARNING | balsam.client.requests_client:109] Attempt retry (0 of 10) of connection: 500 Server Error: Internal Server Error for url: https://balsam-dev.alcf.anl.gov/jobs/
2023-08-31 18:36:03.305 | 29262 | WARNING | balsam.client.requests_client:109] Attempt retry (1 of 10) of connection: 500 Server Error: Internal Server Error for url: https://balsam-dev.alcf.anl.gov/jobs/
2023-08-31 18:36:06.328 | 29262 | WARNING | balsam.client.requests_client:109] Attempt retry (2 of 10) of connection: 500 Server Error: Internal Server Error for url: https://balsam-dev.alcf.anl.gov/jobs/
2023-08-31 18:36:08.263 | 29265 | INFO | balsam.site.job_source:142] JobSource acquired 30 jobs:
2023-08-31 18:36:11.182 | 29262 | WARNING | balsam.client.requests_client:109] Attempt retry (3 of 10) of connection: 500 Server Error: Internal Server Error for url: https://balsam-dev.alcf.anl.gov/jobs/

It will continue retrying but never succeed. The only way to resolve it is to change the job states one at a time with job.save(). Extending the number of retries does not help.

On the server side, logs contain this error:

2023-08-31 04:04:48.457 | 38 | ERROR | balsam.server.utils.timer:75] Exception handling request balsam.server.routers.jobs.bulk_update
Traceback (most recent call last):
  File "/usr/local/lib/python3.10/site-packages/anyio/streams/memory.py", line 94, in receive
    return self.receive_nowait()
  File "/usr/local/lib/python3.10/site-packages/anyio/streams/memory.py", line 89, in receive_nowait
    raise WouldBlock
anyio.WouldBlock

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/balsam/balsam/server/utils/timer.py", line 72, in dispatch
    response = await call_next(request)
  File "/usr/local/lib/python3.10/site-packages/starlette/middleware/base.py", line 35, in call_next
    message = await recv_stream.receive()
  File "/usr/local/lib/python3.10/site-packages/anyio/streams/memory.py", line 102, in receive
    await receive_event.wait()
asyncio.exceptions.CancelledError

I'm not sure what the server side error indicates, but that's all that's apparent in server-balsam.log. This same error over and over. Perhaps @masalim2 has some ideas?

I think this issue stems from the timer middleware, as this part of the traceback is in our code stack:

  File "/balsam/balsam/server/utils/timer.py", line 72, in dispatch
    response = await call_next(request)

This middleware was just there to log request times, which was handy for optimizing queries, but can be removed with no detriment otherwise. I believe this is a known issue that's been solved with more recent versions of fastapi (and its underlying web server starlette):

I would suggest to try a couple of things:

Remove the timer middleware by removing the call to add_middleware in the server: https://github.com/argonne-lcf/balsam/blob/main/balsam/server/main.py#L117 .
Update the container software versions (discussion below)

To do a container software update, I would update the versions in requirements/deploy.txt, because that's what the Dockerfile is installing dependencies from. They are currently pinned to pretty old versions.

You can first get a baseline environment that matches the container env by doing a pip install -r requirements/deploy.txt into a fresh empty virtual environment.
Ascertain the environment with pip freeze > old_environment.txt
Upgrade fastapi and starlette with pip install --upgrade fastapi starlette
Record the new environment and compare whats changed: pip freeze > new_environment.txt
Test out the webserver is still working OK. Something to note is that the latest versions of FastAPI are on Pydantic 2, which has some breaking changes wrt pydantic 1.0. So we still want to be on pydantic v1 for now.
Once the webserver is good, update all the versions in requirements/deploy.txt to match the corresponding versions in your upgraded development environment. This will probably be updates to the versions of fastapi, starlette, and anyio (which is the concurrency library used by starlette). It could also be good to update uvicorn[standard] if it isnt already.
Rebuild and redeploy web container

argonne-lcf / balsam

bulk_update of jobs fails intermittently #383