codalab / codabench

Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper at Patterns Cell Press https://hubs.li/Q01fwRWB0
Apache License 2.0
60 stars 26 forks source link

Queue congestion (public queue) #1446

Open ihsaan-ullah opened 1 month ago

ihsaan-ullah commented 1 month ago

We frequently notice that submissions get stuck in the queue. This happens with RabbitMQ when incoming messages rate is greater then outgoing messages.

How can we solve this issue?

Didayolo commented 1 month ago

TODO P1 development:

The "Submitted" count depends on the queue of the competition (including for default queue).

Problem: sometimes it's stuck on "Submitted", even after service restart, so it may be considered congested indefinitely.


Suggestions:


Bug report: all queues blocked

Apparently, when one queue is congested, then all queues get stuck. @ihsaan-ullah Can we confirm the existence of this bug?

Didayolo commented 1 month ago

There is actually a bug, not simply overload of submissions.

When the bug happens, the workers stopped being able to communicate their status with the main server because of a 500 error.

@ObadaS Do you have some logs to complete this diagnosis?

Didayolo commented 1 month ago

On going: fix server time (5 minutes late)

To check:

timedatectl status

Set time:

sudo ntpdate ntp.u-psud.fr
ObadaS commented 1 month ago
[2024-05-28 07:52:15,862: INFO/ForkPoolWorker-155] Updating submission @ https://www.codabench.org/api/submissions/65174/ with data = {'status': 'Preparing', 'status_details': None, 'secret': ''}
[2024-05-28 07:52:16,003: INFO/ForkPoolWorker-155] Submission patch failed with status = 500, and response = 
b'<h1>Server Error (500)</h1>'
[2024-05-28 07:52:16,003: INFO/ForkPoolWorker-155] Updating submission @ https://www.codabench.org/api/submissions/65174/ with data = {'status': 'Failed', 'status_details': 'Failure updating submission data.', 'secret': ''}
[2024-05-28 07:52:16,116: INFO/ForkPoolWorker-155] Submission patch failed with status = 500, and response = 
b'<h1>Server Error (500)</h1>'
[2024-05-28 07:52:16,116: INFO/ForkPoolWorker-155] Destroying submission temp dir: /codabench/tmpuz0or1fz
[2024-05-28 07:52:16,122: ERROR/ForkPoolWorker-155] Task compute_worker_run[...] raised unexpected: SubmissionException('Failure updating submission data.')
Traceback (most recent call last):
  File "/compute_worker.py", line 115, in run_wrapper
    run.prepare()
  File "/compute_worker.py", line 751, in prepare
    self._update_status(STATUS_PREPARING)
  File "/compute_worker.py", line 359, in _update_status
    self._update_submission(data)
  File "/compute_worker.py", line 342, in _update_submission
    raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/compute_worker.py", line 123, in run_wrapper
    run._update_status(STATUS_FAILED, str(e))
  File "/compute_worker.py", line 359, in _update_status
    self._update_submission(data)
  File "/compute_worker.py", line 342, in _update_submission
    raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.

I removed the secret in the logs, so them being empty is normal. Also removed the id that was in the [] after compute_worker_run

Didayolo commented 1 month ago

Conjointly, users reported that their custom workers stopped receiving submissions. It worked again after we cured the production server (#1455).

ObadaS commented 4 weeks ago

It happened again, a Django container restart fixed the problem as expected. I have more logs to share now:

[2024-06-06 20:34:28 +0000] [13] [INFO] ('172.31.0.5', 55760) - "WebSocket /submission_output/" [accepted]
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server
Internal Server Error: /api/submissions/68303/
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/exception.py", line 34, in inner
    response = get_response(request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 115, in _get_response
    response = self.process_exception_by_middleware(e, request)
  File "/usr/local/lib/python3.8/site-packages/django/core/handlers/base.py", line 113, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/usr/local/lib/python3.8/site-packages/django/views/decorators/csrf.py", line 54, in wrapped_view
    return view_func(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/viewsets.py", line 116, in view
    return self.dispatch(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 495, in dispatch
    response = self.handle_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 455, in handle_exception
    self.raise_uncaught_exception(exc)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/views.py", line 492, in dispatch
    response = handler(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 84, in partial_update
    return self.update(request, *args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 70, in update
    self.perform_update(serializer)
  File "/usr/local/lib/python3.8/site-packages/rest_framework/mixins.py", line 80, in perform_update
    serializer.save()
  File "/usr/local/lib/python3.8/site-packages/rest_framework/serializers.py", line 209, in save
    self.instance = self.update(self.instance, validated_data)
  File "/app/src/apps/api/serializers/submissions.py", line 228, in update
    loop.run_until_complete(channel_layer.group_send(f"submission_listening_{submission.owner.pk}", {
  File "uvloop/loop.pyx", line 1517, in uvloop.loop.Loop.run_until_complete
  File "/usr/local/lib/python3.8/site-packages/channels_redis/core.py", line 649, in group_send
    await connection.zremrangebyscore(
aioredis.errors.ConnectionClosedError: Connection has been closed by server

This is the first submission that is having trouble updating its Status from the Worker. After this error starts, all future updates bugs out with the same error, even tasks that were launched hours before the bug happened. While it seems to be a reddis issue, the logs on the reddis container does not show any abnormality.

While looking at the Django logs, I also found some new errors (not sure if they are related to this issue so I'll add them for now and remove them later if they end up not being relevent) This error starts well before the Django bug related to this issue, which is why I'm not sure if they are related. The log only goes back a few days so I'm not sure when exactly this error starts appearing. It does happen multiple times in the span of days though.

[2024-06-06 20:32:05 +0000] [12] [INFO] ('172.31.0.5', 52770) - "WebSocket /submission_output/" [accepted]
Error in opening handshake
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 130, in handler
    path = await self.handshake(
  File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 584, in handshake
    self.write_http_response(http.HTTPStatus.SWITCHING_PROTOCOLS, response_headers)
  File "/usr/local/lib/python3.8/site-packages/websockets/server.py", line 264, in write_http_response
    self.transport.write(response.encode())
  File "uvloop/handles/stream.pyx", line 674, in uvloop.loop.UVStream.write
  File "uvloop/handles/handle.pyx", line 159, in uvloop.loop.UVHandle._ensure_alive
RuntimeError: unable to perform operation on <TCPTransport closed=True reading=False [...] >; the handler is closed
[2024-06-06 20:32:05 +0000] [12] [ERROR] Exception in ASGI application
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 157, in run_asgi
    result = await self.app(self.scope, self.asgi_receive, self.asgi_send)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/middleware/asgi2.py", line 7, in __call__
    await instance(receive, send)
  File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 183, in __call__
    return await self.inner(receive, self.send)
  File "/usr/local/lib/python3.8/site-packages/channels/middleware.py", line 41, in coroutine_call
    await inner_instance(receive, send)
  File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 58, in __call__
    await await_many_dispatch(
  File "/usr/local/lib/python3.8/site-packages/channels/utils.py", line 51, in await_many_dispatch
    await dispatch(result)
  File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 73, in dispatch
    await handler(message)
  File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 240, in websocket_disconnect
    await self.disconnect(message["code"])
  File "/app/src/apps/competitions/consumers.py", line 65, in disconnect
    await self.close()
  File "/usr/local/lib/python3.8/site-packages/channels/generic/websocket.py", line 226, in close
    await super().send({"type": "websocket.close"})
  File "/usr/local/lib/python3.8/site-packages/channels/consumer.py", line 81, in send
    await self.base_send(message)
  File "/usr/local/lib/python3.8/site-packages/channels/sessions.py", line 236, in send
    return await self.real_send(message)
  File "/usr/local/lib/python3.8/site-packages/uvicorn/protocols/websockets/websockets_impl.py", line 222, in asgi_send
    await self.close(code)
  File "/usr/local/lib/python3.8/site-packages/websockets/protocol.py", line 697, in close
    await asyncio.wait_for(
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 475, in wait_for
    fut = ensure_future(fut, loop=loop)
  File "/usr/local/lib/python3.8/asyncio/tasks.py", line 684, in ensure_future
    raise TypeError('An asyncio.Future, a coroutine or an awaitable is '
TypeError: An asyncio.Future, a coroutine or an awaitable is required

The [...] replaces a hexadecimal number which I'm not sure should be kept private or not so I removed it for now

Didayolo commented 4 weeks ago

Nice investigation. I wonder why after a first failure then the next submissions fail too.

ObadaS commented 4 weeks ago

And the internal server error also only affects submissions. It seems like every other part of the website works correctly except for this when it happens.

Is it possible to get more information from Django in the log concerning the "Internal Server Error" ? I feel like the real error is being hidden, or at least that it's not clear enough with the logs we have.