Open ihsaan-ullah opened 6 months ago
TODO P1 development:
The "Submitted" count depends on the queue of the competition (including for default queue).
Problem: sometimes it's stuck on "Submitted", even after service restart, so it may be considered congested indefinitely.
Suggestions:
Apparently, when one queue is congested, then all queues get stuck. @ihsaan-ullah Can we confirm the existence of this bug?
There is actually a bug, not simply overload of submissions.
When the bug happens, the workers stopped being able to communicate their status with the main server because of a 500 error.
@ObadaS Do you have some logs to complete this diagnosis?
On going: fix server time (5 minutes late)
To check:
timedatectl status
Set time:
sudo ntpdate ntp.u-psud.fr
[2024-05-28 07:52:15,862: INFO/ForkPoolWorker-155] Updating submission @ https://www.codabench.org/api/submissions/65174/ with data = {'status': 'Preparing', 'status_details': None, 'secret': ''}
[2024-05-28 07:52:16,003: INFO/ForkPoolWorker-155] Submission patch failed with status = 500, and response =
b'<h1>Server Error (500)</h1>'
[2024-05-28 07:52:16,003: INFO/ForkPoolWorker-155] Updating submission @ https://www.codabench.org/api/submissions/65174/ with data = {'status': 'Failed', 'status_details': 'Failure updating submission data.', 'secret': ''}
[2024-05-28 07:52:16,116: INFO/ForkPoolWorker-155] Submission patch failed with status = 500, and response =
b'<h1>Server Error (500)</h1>'
[2024-05-28 07:52:16,116: INFO/ForkPoolWorker-155] Destroying submission temp dir: /codabench/tmpuz0or1fz
[2024-05-28 07:52:16,122: ERROR/ForkPoolWorker-155] Task compute_worker_run[...] raised unexpected: SubmissionException('Failure updating submission data.')
Traceback (most recent call last):
File "/compute_worker.py", line 115, in run_wrapper
run.prepare()
File "/compute_worker.py", line 751, in prepare
self._update_status(STATUS_PREPARING)
File "/compute_worker.py", line 359, in _update_status
self._update_submission(data)
File "/compute_worker.py", line 342, in _update_submission
raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
R = retval = fun(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
return self.run(*args, **kwargs)
File "/compute_worker.py", line 123, in run_wrapper
run._update_status(STATUS_FAILED, str(e))
File "/compute_worker.py", line 359, in _update_status
self._update_submission(data)
File "/compute_worker.py", line 342, in _update_submission
raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.
I removed the secret
in the logs, so them being empty is normal.
Also removed the id that was in the [] after compute_worker_run
Conjointly, users reported that their custom workers stopped receiving submissions. It worked again after we cured the production server (#1455).
It happened again, a Django container restart fixed the problem as expected. I have more logs to share now:
[2024-06-06 20:34:28 +0000] [13] [INFO] ('172.31.0.5', 55760) - "WebSocket /submission_output/" [accepted]
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
response = self.handle_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
self.raise_uncaught_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
response = handler(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
return self.update(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
self.perform_update(serializer)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
serializer.save()
File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
self.instance = self.update(self.instance, validated_data)
File "/app/src/apps/api/serializers/submissions.py", line 228, in update
loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
response = self.handle_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
self.raise_uncaught_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
response = handler(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
return self.update(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
self.perform_update(serializer)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
serializer.save()
File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
self.instance = self.update(self.instance, validated_data)
File "/app/src/apps/api/serializers/submissions.py", line 228, in update
loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
response = self.handle_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
self.raise_uncaught_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
response = handler(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
return self.update(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
self.perform_update(serializer)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
serializer.save()
File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
self.instance = self.update(self.instance, validated_data)
File "/app/src/apps/api/serializers/submissions.py", line 228, in update
loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
response = get_response(request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
response = self.process_exception_by_middleware(e, request)
File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
response = wrapped_callback(request, *callback_args, **callback_kwargs)
File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
return view_func(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
return self.dispatch(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
response = self.handle_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
self.raise_uncaught_exception(exc)
File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
response = handler(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
return self.update(request, *args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
self.perform_update(serializer)
File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
serializer.save()
File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
self.instance = self.update(self.instance, validated_data)
File "/app/src/apps/api/serializers/submissions.py", line 228, in update
loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
This is the first submission that is having trouble updating its Status from the Worker. After this error starts, all future updates bugs out with the same error, even tasks that were launched hours before the bug happened. While it seems to be a reddis issue, the logs on the reddis container does not show any abnormality.
While looking at the Django logs, I also found some new errors (not sure if they are related to this issue so I'll add them for now and remove them later if they end up not being relevent) This error starts well before the Django bug related to this issue, which is why I'm not sure if they are related. The log only goes back a few days so I'm not sure when exactly this error starts appearing. It does happen multiple times in the span of days though.
[2024-06-06 20:32:05 +0000] [12] [INFO] ('172.31.0.5', 52770) - "WebSocket /submission_output/" [accepted]
Error in opening handshake
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 130, in handler
path = await self.handshake(
File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 584, in handshake
self.write_http_response(http.HTTPStatus.SWITCHING_PROTOCOLS, response_headers)
File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 264, in write_http_response
self.transport.write(response.encode())
File "uvloop/handles/stream.pyx", line 674, in uvloop.loop.UVStream.write
File "uvloop/handles/handle.pyx", line 159, in uvloop.loop.UVHandle._ensure_alive
RuntimeError: unable to perform operation on <TCPTransport closed=True reading=False [...] >; the handler is closed
[2024-06-06 20:32:05 +0000] [12] [ERROR] Exception in ASGI application
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 157, in run_asgi
result = await self.app(self.scope, self.asgi_receive, self.asgi_send)
File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
return await self.app(scope, receive, send)
File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/asgi2.py", line 7, in __call__
await instance(receive, send)
File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 183, in __call__
return await self.inner(receive, self.send)
File "/usr/local/lib/python3.8/site-packages/channels/middleware.py", line 41, in coroutine_call
await inner_instance(receive, send)
File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 58, in __call__
await await_many_dispatch(
File "/usr/local/lib/python3.8/site-packages/channels/utils.py", line 51, in await_many_dispatch
await dispatch(result)
File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 73, in dispatch
await handler(message)
File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 240, in websocket_disconnect
await self.disconnect(message["code"])
File "/app/src/apps/competitions/consumers.py", line 65, in disconnect
await self.close()
File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 226, in close
await super().send({"type": "websocket.close"})
File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 81, in send
await self.base_send(message)
File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 236, in send
return await self.real_send(message)
File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 222, in asgi_send
await self.close(code)
File "/usr/local/lib/python3.8/site-packages/websockets/protocol.py", line 697, in close
await asyncio.wait_for(
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 475, in wait_for
fut = ensure_future(fut, loop=loop)
File "/usr/local/lib/python3.8/asyncio/tasks.py", line 684, in ensure_future
raise TypeError('An asyncio.Future, a coroutine or an awaitable is '
TypeError: An asyncio.Future, a coroutine or an awaitable is required
The [...] replaces a hexadecimal number which I'm not sure should be kept private or not so I removed it for now
Nice investigation. I wonder why after a first failure then the next submissions fail too.
And the internal server error also only affects submissions. It seems like every other part of the website works correctly except for this when it happens.
Is it possible to get more information from Django in the log concerning the "Internal Server Error" ? I feel like the real error is being hidden, or at least that it's not clear enough with the logs we have.
Hello, I think I have finally found what is causing this issue.
I managed to get some more logs after activating the DEBUG log level on Django and waiting for the problem to appear again.
When things works well, this appears in the log every time a worker connects to Django to update a submission. (I removed the submission_id
value)
Creating tcp connection to ('redis', 6379)
server > Frame(fin=True, opcode=1, data=b'{"type": "message", "submission_id": , "data": {"kind": "status_update", "status": "Running"}}', rsv1=False, rsv2=False, rsv3=False)
When the problem happens, this changes a bit, becoming:
Creating tcp connection to ('redis', 6379)
Cancelling waiter (<Future pending cb=[<TaskWakeupMethWrapper object at 0x72656d5ec580>()]>, [None, None])
Internal Server Error: /api/submissions/
After some research, I found this issue : https://github.com/django/channels_redis/issues/279#issuecomment-973147677 This is for a package that we are actually using in the project.
Basically, the connection to Redis does not get closed properly. Eventually, no more connection can be created.
The comment says the issue appeared in aioredis 1.3.x
(we are using aioredis 1.3.1
through channels-redis 3.2.0
).
So this will happen more quickly the more submissions we have, having to restart Django to temporarily reset the connections number.
Update the channels-redis
to version 4.0.0
minimum, which has the aioredis 2.0
that fixes the issue.
However, this does not seem to be as easy. I tried to update the packages manually on my PC, but it seems like one of the required package, django-redis-cache
is not maintained anymore, but is needed when trying to upgrade our packages to use channels-redis 4.0.0
. It probably has been migrated to another name or absorbed by another package, so we will need to investigate a bit to successfully upgrade the package and fix the problem once and for all (hopefully)
We frequently notice that submissions get stuck in the queue. This happens with RabbitMQ when incoming messages rate is greater then outgoing messages.
How can we solve this issue?