codalab / codabench

Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper at Patterns Cell Press https://hubs.li/Q01fwRWB0
Apache License 2.0
77 stars 28 forks source link

Submission stuck in submitted phase - worker failing to update submission #1314

Closed jpbrunetsx closed 6 months ago

jpbrunetsx commented 9 months ago

Hi,

We are running the following competition : https://www.codabench.org/competitions/1534/

Today the submission appear stuck in submitted phase. Checking on the worker logs, it seems they receive the submission well but cannot update them. I join the logs.

Maybe linked to #1302


[2024-02-08 15:05:46,840: INFO/MainProcess] Received task: compute_worker_run[9db2d13f-e4ee-4323-b3c6-2b0d49455aee]  
[2024-02-08 15:05:46,841: INFO/ForkPoolWorker-1] Received run arguments: {'user_pk': 1231, 'submissions_api_url': 'https://www.codabench.org/api', 'secret': '7fdba395-6611-4196-82a1-1a4cad1dbe24', 'docker_image': 'lipsbenchmark/ml4physim:1.2', 'execution_time_limit': 262800, 'id': 46819, 'is_scoring': False, 'prediction_result': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/prediction_result/2024-02-08-1707404673/cdaf8175f070/prediction_result.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=X8A8iLpNHx4tXUS50pB4zc7gTsE%3D&content-type=application%2Fzip&Expires=1707491074', 'ingestion_program': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/dataset/2024-01-30-1706633448/27d7b9f2cec2/metadata.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=gwWA5ByrSLu5%2FeCLdSIRbxo1dok%3D&Expires=1707491074', 'ingestion_only_during_scoring': False, 'program_data': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/dataset/2024-02-08-1707394647/6fbc9578712a/submission_attention.zip?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=cSaTE7wHEUhmPF0qMLAVElKhbOY%3D&Expires=1707491074', 'prediction_stdout': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-02-08-1707404674/a9fb5c5ca0da/prediction_stdout.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=HUQibYTrmgk3UIgyoEXXsIS3RZ4%3D&content-type=application%2Fzip&Expires=1707491074', 'prediction_stderr': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-02-08-1707404674/345c40654fe6/prediction_stderr.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=K1GgvGSSpvIlUhgrStZDuhp5b7k%3D&content-type=application%2Fzip&Expires=1707491074', 'prediction_ingestion_stdout': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-02-08-1707404674/d0fa1602d704/prediction_ingestion_stdout.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=N9qIls22%2FfFNYowCm%2FZ%2F2kSCdM0%3D&content-type=application%2Fzip&Expires=1707491074', 'prediction_ingestion_stderr': 'https://miniodis-rproxy.lisn.upsaclay.fr/coda-v2-prod-private/submission_details/2024-02-08-1707404674/aff12ddbd5d8/prediction_ingestion_stderr.txt?AWSAccessKeyId=EASNOMJFX9QFW4QIY4SL&Signature=BY31I2a0wN5RPzOHdhTEEMDHNqs%3D&content-type=application%2Fzip&Expires=1707491074'}
[2024-02-08 15:05:46,841: INFO/ForkPoolWorker-1] Updating submission @ https://www.codabench.org/api/submissions/46819/ with data = {'status': 'Preparing', 'status_details': None, 'secret': '7fdba395-6611-4196-82a1-1a4cad1dbe24'}
[2024-02-08 15:05:51,993: INFO/ForkPoolWorker-1] Submission patch failed with status = 500, and response = 
b'<h1>Server Error (500)</h1>'
[2024-02-08 15:05:51,994: INFO/ForkPoolWorker-1] Updating submission @ https://www.codabench.org/api/submissions/46819/ with data = {'status': 'Failed', 'status_details': 'Failure updating submission data.', 'secret': '7fdba395-6611-4196-82a1-1a4cad1dbe24'}
[2024-02-08 15:05:52,087: INFO/ForkPoolWorker-1] Submission patch failed with status = 500, and response = 
b'<h1>Server Error (500)</h1>'
[2024-02-08 15:05:52,088: INFO/ForkPoolWorker-1] Destroying submission temp dir: /codabench/tmp3d2jx7s6
[2024-02-08 15:05:52,088: ERROR/ForkPoolWorker-1] Task compute_worker_run[9db2d13f-e4ee-4323-b3c6-2b0d49455aee] raised unexpected: SubmissionException('Failure updating submission data.')
Traceback (most recent call last):
  File "/compute_worker.py", line 115, in run_wrapper
    run.prepare()
  File "/compute_worker.py", line 759, in prepare
    self._update_status(STATUS_PREPARING)
  File "/compute_worker.py", line 359, in _update_status
    self._update_submission(data)
  File "/compute_worker.py", line 342, in _update_submission
    raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 385, in trace_task
    R = retval = fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/site-packages/celery/app/trace.py", line 650, in __protected_call__
    return self.run(*args, **kwargs)
  File "/compute_worker.py", line 123, in run_wrapper
    run._update_status(STATUS_FAILED, str(e))
  File "/compute_worker.py", line 359, in _update_status
    self._update_submission(data)
  File "/compute_worker.py", line 342, in _update_submission
    raise SubmissionException("Failure updating submission data.")
compute_worker.SubmissionException: Failure updating submission data.```
Didayolo commented 9 months ago

Hi,

Thank you for reporting the problem.

We encountered this problem several times recently. I restarted the service and it should be good now. However, you'll need to re-submit the submissions.

We are investigating the problem to avoid the requiring a restart in the future.

zhangzhao219 commented 9 months ago

Now the server is down again? Again and again?

zhangzhao219 commented 9 months ago

Now the server is down again? Again and again?

Do you guys still have bugs to generate? I've been participating in one contest on this for about a month, and the server has crashed at least twice, I've been unable to submit at least 10 times, the longest lasting three days. During this time it has all sorts of problems, including running for very long periods, getting stuck on submitted, submitting, scoring, etc., The site loads very jerky and has a huge delay in the display of the pages. It's not like there aren't ready-made website samples for you guys to refer to, including but not limited to kaggle, tianchi, etc., so please can you guys just create some more bugs and work it out together before pushing it online?

Didayolo commented 9 months ago

@zhangzhao219 Indeed, we have some infrastructure issues to solve at the moment.

By the way, Codabench is a free community project, currently maintained by a team of 5 people. Feel free to join us if you are able to help. It does not make sense to compare it with Kaggle, which is ran by more than 1,000 employees and has more than $10 millions of revenue.

By the way, the server is up again.

Last point, having your submission "stuck on submitted" is not necessarily a bug. If a competition organizer does not provide enough computing power to handle the demand of the competition, the submissions get queued. Competition organizers are independant from the platform developers and providers, so we cannot be blamed for the lack of resources provided by the competition you are participating in.

curious-broccoli commented 8 months ago

I have a similar/the same problem. In my own instance the the compute worker sometimes loses the connections so all submissions would get stuck on "Submitted". Simply up -d the compute worker fixes this. Actually, I have had this problem a few times and never checked the logs properly so I simply assume that it is always that same error. Also, if the logs reveal to you that the shutdown was not related to codabench/the connection loss then it might be an issue with the VM I use, because I have experienced something shutting down for no obvious reason on another VM in another project. In that case you can simply ignore this comment.

Maybe this log can help with the issue:

compute_worker-1  | [2024-03-08 14:45:30,977: INFO/ForkPoolWorker-1] *** PUT RESPONSE: ***
compute_worker-1  | [2024-03-08 14:45:30,977: INFO/ForkPoolWorker-1] response: <Response [200]>
compute_worker-1  | [2024-03-08 14:45:30,977: INFO/ForkPoolWorker-1] content: b''
compute_worker-1  | [2024-03-08 14:45:30,977: INFO/ForkPoolWorker-1] CODALAB_IGNORE_CLEANUP_STEP mode enabled, ignoring clean up of: /codabench/tmpotyaqnlb
compute_worker-1  | [2024-03-08 14:45:30,979: INFO/ForkPoolWorker-1] Task compute_worker_run[0fe5e637-8a1b-4edd-b69c-cfc1104ff103] succeeded in 3.672250556992367s: None
compute_worker-1  | [2024-03-09 01:04:53,190: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
compute_worker-1  | Traceback (most recent call last):
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 318, in start
compute_worker-1  |     blueprint.start(self)
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
compute_worker-1  |     step.start(parent)
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 596, in start
compute_worker-1  |     c.loop(*c.loop_args())
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/celery/worker/loops.py", line 83, in asynloop
compute_worker-1  |     next(loop)
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/asynchronous/hub.py", line 364, in create_loop
compute_worker-1  |     cb(*cbargs)
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/transport/base.py", line 238, in on_readable
compute_worker-1  |     reader(loop)
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/kombu/transport/base.py", line 220, in _read
compute_worker-1  |     drain_events(timeout=0)
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 508, in drain_events
compute_worker-1  |     while not self.blocking_read(timeout):
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 514, in blocking_read
compute_worker-1  |     return self.on_inbound_frame(frame)
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/method_framing.py", line 55, in on_frame
compute_worker-1  |     callback(channel, method_sig, buf, None)
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 520, in on_inbound_method
compute_worker-1  |     return self.channels[channel_id].dispatch_method(
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
compute_worker-1  |     listener(*args)
compute_worker-1  |   File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 650, in _on_close
compute_worker-1  |     raise error_for_code(reply_code, reply_text,
compute_worker-1  | amqp.exceptions.ConnectionForced: (0, 0): (320) CONNECTION_FORCED - broker forced connection closure with reason 'shutdown'
compute_worker-1  | 
compute_worker-1  | worker: Hitting Ctrl+C again will terminate all running tasks!
compute_worker-1  | 
compute_worker-1  | worker: Warm shutdown (MainProcess)
Didayolo commented 8 months ago

@curious-broccoli To understand better your situation:

curious-broccoli commented 8 months ago

@curious-broccoli To understand better your situation:

* You have your own server? Accessible online by users?

* The default queue runs on the compute worker service from the main VM (the VM of the instance)? Or do you have separated workers?

Yes I host our own instance on a server accessible online (not globally). Only the default queue is used and it all runs on the same VM which also runs the instance/webserver etc.

Didayolo commented 8 months ago

I have the feeling that the problem may comes from performance problem. If the VM is already using many resources to manage the web server, the queue may ends up congested.

Do you have many users and submissions?

It is recommended to have separate workers linked to the default queue of the platform.

See: