Open cms21 opened 11 months ago
I think this issue stems from the timer middleware, as this part of the traceback is in our code stack:
File "/balsam/balsam/server/utils/timer.py", line 72, in dispatch
response = await call_next(request)
This middleware was just there to log request times, which was handy for optimizing queries, but can be removed with no detriment otherwise. I believe this is a known issue that's been solved with more recent versions of fastapi (and its underlying web server starlette):
I would suggest to try a couple of things:
add_middleware
in the server: https://github.com/argonne-lcf/balsam/blob/main/balsam/server/main.py#L117 . To do a container software update, I would update the versions in requirements/deploy.txt
, because that's what the Dockerfile is installing dependencies from. They are currently pinned to pretty old versions.
pip install -r requirements/deploy.txt
into a fresh empty virtual environment. pip freeze > old_environment.txt
pip install --upgrade fastapi starlette
pip freeze > new_environment.txt
requirements/deploy.txt
to match the corresponding versions in your upgraded development environment. This will probably be updates to the versions of fastapi
, starlette
, and anyio
(which is the concurrency library used by starlette). It could also be good to update uvicorn[standard]
if it isnt already.
This has been encountered by a few users. There are times when bulk updates of job statuses fail. For example, advancing jobs from STAGED_IN to PREPROCESSED or RUN_DONE to POSTPROCESSED. I've not been able to figure out what triggers this issue, but once a site encounters this issue, it remains persistent for all jobs in the site. Restarting the site and/or restarting the server does not help. The only solution I've been able to find is to update jobs individually, but this is tedious and the site typically will continue to have the issue for new jobs.
On the client side, logs contain this error:
It will continue retrying but never succeed. The only way to resolve it is to change the job states one at a time with
job.save()
. Extending the number of retries does not help.On the server side, logs contain this error:
I'm not sure what the server side error indicates, but that's all that's apparent in
server-balsam.log
. This same error over and over. Perhaps @masalim2 has some ideas?