codalab / codabench

Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper at Patterns Cell Press https://hubs.li/Q01fwRWB0
Apache License 2.0
76 stars 28 forks source link

Submissions Stuck Again for Competition 4430 #1666

Open archettialberto opened 7 hours ago

archettialberto commented 7 hours ago

Dear Codabench Team,

We are running the competition https://www.codabench.org/competitions/4430. The issue of stuck submissions on Internal Server Error 500 is back. From 4PM today all workers on our servers return errors like the following. How should we handle this issue?

On a side note, following #1664, these are the last days of competition and the website is becoming extremely slow when trying to load submissions. Is there a way to solve this?

an2dl-worker7  | [2024-11-16 22:06:12,120: INFO/ForkPoolWorker-1] Updating submission @ https://www.codabench.org/api/submissions/159973/ with data = {'status': 'Preparing', 'status_details': None, 'secret': '...'}
an2dl-worker7  | [2024-11-16 22:06:13,775: INFO/ForkPoolWorker-1] Submission patch failed with status = 500, and response =
an2dl-worker7  | b'<h1>Server Error (500)</h1>'
an2dl-worker7  | [2024-11-16 22:06:13,775: INFO/ForkPoolWorker-1] Updating submission @ https://www.codabench.org/api/submissions/159973/ with data = {'status': 'Failed', 'status_details': 'Failure updating submission data.', 'secret': '...'}
an2dl-worker7  | [2024-11-16 22:06:13,838: INFO/ForkPoolWorker-1] Submission patch failed with status = 500, and response =
an2dl-worker7  | b'<h1>Server Error (500)</h1>'
an2dl-worker7  | [2024-11-16 22:06:13,838: INFO/ForkPoolWorker-1] Destroying submission temp dir: /codabench/tmpkn55zvfz
an2dl-worker7  | [2024-11-16 22:06:13,840: ERROR/ForkPoolWorker-1] Task compute_worker_run[fe0f09c7-6a74-46b8-bc27-cb29a3f74890] raised unexpected: SubmissionException('Failure updating submission data.')
an2dl-worker7  | Traceback (most recent call last):
an2dl-worker7  |   File "/compute_worker.py", line 112, in run_wrapper
an2dl-worker7  |     run.prepare()
an2dl-worker7  |   File "/compute_worker.py", line 765, in prepare
an2dl-worker7  |     self._update_status(STATUS_PREPARING)
an2dl-worker7  |   File "/compute_worker.py", line 356, in _update_status
an2dl-worker7  |     self._update_submission(data)
an2dl-worker7  |   File "/compute_worker.py", line 339, in _update_submission
an2dl-worker7  |     raise SubmissionException("Failure updating submission data.")
an2dl-worker7  | compute_worker.SubmissionException: Failure updating submission data.
an2dl-worker7  |
an2dl-worker7  | During handling of the above exception, another exception occurred:
an2dl-worker7  |
an2dl-worker7  | Traceback (most recent call last):
an2dl-worker7  |   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 385, in trace_task
an2dl-worker7  |     R = retval = fun(*args, **kwargs)
an2dl-worker7  |   File "/usr/local/lib/python3.9/site-packages/celery/app/trace.py", line 650, in __protected_call__
an2dl-worker7  |     return self.run(*args, **kwargs)
an2dl-worker7  |   File "/compute_worker.py", line 120, in run_wrapper
an2dl-worker7  |     run._update_status(STATUS_FAILED, str(e))
an2dl-worker7  |   File "/compute_worker.py", line 356, in _update_status
an2dl-worker7  |     self._update_submission(data)
an2dl-worker7  |   File "/compute_worker.py", line 339, in _update_submission
an2dl-worker7  |     raise SubmissionException("Failure updating submission data.")
an2dl-worker7  | compute_worker.SubmissionException: Failure updating submission data.
ObadaS commented 7 hours ago

Hello, I fixed the issue for now. You can check https://github.com/codalab/codabench/issues/1446 to know more about this particular problem and to know when we have a more permanent fix.

As for the overall platform performance, we are also working on various optimizations to make it more responsive

archettialberto commented 7 hours ago

Thank you for the help!

We are expecting a lot of submissions between the 20th and 22nd November, since we will move to a different phase of our competition. Would deleting submissions from the previous phase improve the overall performance? At least this way we could stop/rerun some submissions manually, as right now the submission list takes minutes to load.