codalab / codabench

Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper at Patterns Cell Press https://hubs.li/Q01fwRWB0
Apache License 2.0
76 stars 28 forks source link

Queue congestion (public queue) #1446

Open ihsaan-ullah opened 6 months ago

ihsaan-ullah commented 6 months ago

We frequently notice that submissions get stuck in the queue. This happens with RabbitMQ when incoming messages rate is greater then outgoing messages.

How can we solve this issue?

Didayolo commented 6 months ago

TODO P1 development:

The "Submitted" count depends on the queue of the competition (including for default queue).

Problem: sometimes it's stuck on "Submitted", even after service restart, so it may be considered congested indefinitely.


Suggestions:


Bug report: all queues blocked

Apparently, when one queue is congested, then all queues get stuck. @ihsaan-ullah Can we confirm the existence of this bug?

Didayolo commented 5 months ago

There is actually a bug, not simply overload of submissions.

When the bug happens, the workers stopped being able to communicate their status with the main server because of a 500 error.

@ObadaS Do you have some logs to complete this diagnosis?

Didayolo commented 5 months ago

On going: fix server time (5 minutes late)

To check:

timedatectl status

Set time:

sudo ntpdate ntp.u-psud.fr
ObadaS commented 5 months ago
[2024-05-28 07:52:15,862: INFO/ForkPoolWorker-155] Updating submission @ https://www.codabench.org/api/submissions/65174/ with data = {'status': 'Preparing', 'status_details': None, 'secret': ''}
[2024-05-28 07:52:16,003: INFO/ForkPoolWorker-155] Submission patch failed with status = 500, and response = 
b'<h1>Server Error (500)</h1>'
[2024-05-28 07:52:16,003: INFO/ForkPoolWorker-155] Updating submission @ https://www.codabench.org/api/submissions/65174/ with data = {'status': 'Failed', 'status_details': 'Failure updating submission data.', 'secret': ''}
[2024-05-28 07:52:16,116: INFO/ForkPoolWorker-155] Submission patch failed with status = 500, and response = 
b'<h1>Server Error (500)</h1>'
[2024-05-28 07:52:16,116: INFO/ForkPoolWorker-155] Destroying submission temp dir: /codabench/tmpuz0or1fz
[2024-05-28 07:52:16,122: ERROR/ForkPoolWorker-155] Task compute_worker_run[...] raised unexpected: SubmissionException('Failure updating submission data.')
Traceback (most recent call last):
  File "/compute_worker.py", line 115, in run_wrapper
    run.prepare()
  File "/compute_worker.py", line 751, in prepare
    self._update_status(STATUS_PREPARING)
  File "/compute_worker.py", line 359, in _update_status
    self._update_submission(data)
  File "/compute_worker.py", line 342, in _update_submission
    raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/compute_worker.py", line 123, in run_wrapper
    run._update_status(STATUS_FAILED, str(e))
  File "/compute_worker.py", line 359, in _update_status
    self._update_submission(data)
  File "/compute_worker.py", line 342, in _update_submission
    raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.

I removed the secret in the logs, so them being empty is normal. Also removed the id that was in the [] after compute_worker_run

Didayolo commented 5 months ago

Conjointly, users reported that their custom workers stopped receiving submissions. It worked again after we cured the production server (#1455).

ObadaS commented 5 months ago

It happened again, a Django container restart fixed the problem as expected. I have more logs to share now:

[2024-06-06 20:34:28 +0000] [13] [INFO] ('172.31.0.5', 55760) - "WebSocket /submission_output/" [accepted]
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server

This is the first submission that is having trouble updating its Status from the Worker. After this error starts, all future updates bugs out with the same error, even tasks that were launched hours before the bug happened. While it seems to be a reddis issue, the logs on the reddis container does not show any abnormality.

While looking at the Django logs, I also found some new errors (not sure if they are related to this issue so I'll add them for now and remove them later if they end up not being relevent) This error starts well before the Django bug related to this issue, which is why I'm not sure if they are related. The log only goes back a few days so I'm not sure when exactly this error starts appearing. It does happen multiple times in the span of days though.

[2024-06-06 20:32:05 +0000] [12] [INFO] ('172.31.0.5', 52770) - "WebSocket /submission_output/" [accepted]
Error in opening handshake
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 130, in handler
    path = await self.handshake(
  File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 584, in handshake
    self.write_http_response(http.HTTPStatus.SWITCHING_PROTOCOLS, response_headers)
  File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 264, in write_http_response
    self.transport.write(response.encode())
  File "uvloop/handles/stream.pyx", line 674, in uvloop.loop.UVStream.write
  File "uvloop/handles/handle.pyx", line 159, in uvloop.loop.UVHandle._ensure_alive
RuntimeError: unable to perform operation on <TCPTransport closed=True reading=False [...] >; the handler is closed
[2024-06-06 20:32:05 +0000] [12] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 157, in run_asgi
    result = await self.app(self.scope, self.asgi_receive, self.asgi_send)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/asgi2.py", line 7, in __call__
    await instance(receive, send)
  File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 183, in __call__
    return await self.inner(receive, self.send)
  File "/usr/local/lib/python3.8/site-packages/channels/middleware.py", line 41, in coroutine_call
    await inner_instance(receive, send)
  File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 58, in __call__
    await await_many_dispatch(
  File "/usr/local/lib/python3.8/site-packages/channels/utils.py", line 51, in await_many_dispatch
    await dispatch(result)
  File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 73, in dispatch
    await handler(message)
  File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 240, in websocket_disconnect
    await self.disconnect(message["code"])
  File "/app/src/apps/competitions/consumers.py", line 65, in disconnect
    await self.close()
  File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 226, in close
    await super().send({"type": "websocket.close"})
  File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 81, in send
    await self.base_send(message)
  File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 236, in send
    return await self.real_send(message)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 222, in asgi_send
    await self.close(code)
  File "/usr/local/lib/python3.8/site-packages/websockets/protocol.py", line 697, in close
    await asyncio.wait_for(
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 475, in wait_for
    fut = ensure_future(fut, loop=loop)
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 684, in ensure_future
    raise TypeError('An asyncio.Future, a coroutine or an awaitable is '
TypeError: An asyncio.Future, a coroutine or an awaitable is required

The [...] replaces a hexadecimal number which I'm not sure should be kept private or not so I removed it for now

Didayolo commented 5 months ago

Nice investigation. I wonder why after a first failure then the next submissions fail too.

ObadaS commented 5 months ago

And the internal server error also only affects submissions. It seems like every other part of the website works correctly except for this when it happens.

Is it possible to get more information from Django in the log concerning the "Internal Server Error" ? I feel like the real error is being hidden, or at least that it's not clear enough with the logs we have.

ObadaS commented 1 week ago

Hello, I think I have finally found what is causing this issue.

I managed to get some more logs after activating the DEBUG log level on Django and waiting for the problem to appear again.

When things works well, this appears in the log every time a worker connects to Django to update a submission. (I removed the submission_id value)

Creating tcp connection to ('redis', 6379)
server > Frame(fin=True, opcode=1, data=b'{"type": "message", "submission_id": , "data": {"kind": "status_update", "status": "Running"}}', rsv1=False, rsv2=False, rsv3=False)

When the problem happens, this changes a bit, becoming:

Creating tcp connection to ('redis', 6379)
Cancelling waiter (<Future pending cb=[<TaskWakeupMethWrapper object at 0x72656d5ec580>()]>, [None, None])
Internal Server Error: /api/submissions/

After some research, I found this issue : https://github.com/django/channels_redis/issues/279#issuecomment-973147677 This is for a package that we are actually using in the project.

Basically, the connection to Redis does not get closed properly. Eventually, no more connection can be created. The comment says the issue appeared in aioredis 1.3.x (we are using aioredis 1.3.1 through channels-redis 3.2.0).

So this will happen more quickly the more submissions we have, having to restart Django to temporarily reset the connections number.

Solution

Update the channels-redis to version 4.0.0 minimum, which has the aioredis 2.0 that fixes the issue.

However, this does not seem to be as easy. I tried to update the packages manually on my PC, but it seems like one of the required package, django-redis-cache is not maintained anymore, but is needed when trying to upgrade our packages to use channels-redis 4.0.0. It probably has been migrated to another name or absorbed by another package, so we will need to investigate a bit to successfully upgrade the package and fix the problem once and for all (hopefully)