codalab / codabench

Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper at Patterns Cell Press https://hubs.li/Q01fwRWB0
Apache License 2.0
65 stars 26 forks source link

Remote workers don't receive submissions #1455

Closed johanneskruse closed 4 months ago

johanneskruse commented 4 months ago

Dear Codabench team,

My remote workers have stopped receiving any traffic - is there an explanation, recent update, etc. From one day to the next, submissions are not being processed. Everything was good on May 23, 2024, but they stopped working on May 24, 2024.

I have multiple remote workers, and I do see that they are connected when turning on/off:

[2024-05-24 19:50:31,650: INFO/MainProcess] missed heartbeat from compute-worker@94d6bfcb71122
[2024-05-24 19:55:00,922: INFO/MainProcess] sync with compute-worker@6e369caec6052

When using the default CPU queue, I can run submissions; however, due to the 20-minute limit, I have to use my own remote workers.

Link to Ekstra Bladet News Recommendation Competition

Best, Johannes

ihsaan-ullah commented 4 months ago

Hi @johanneskruse

This is strange. We haven't changed anything that will break something. Can you please confirm the following:

  1. Your submissions work on the default queue
  2. Your workers are listening and the broker URL of your queue is configured there
  3. Do you have GPU or CPU workers?
  4. Are you trying by rerunning the submission or resubmitting
johanneskruse commented 4 months ago

hi,

Thanks for the quick response.

  1. Yes, it works on default queue, but fails due to 20-minute time limit.
  2. The broker URL has been working until the 24.05.24 - never changed it. The workers are communicating with each other (have multiple). It been working perfectly for more than a month
  3. They are CPU clusters
  4. Yes, I have tried to resubmit and rerun submissions
ihsaan-ullah commented 4 months ago

Where are your workers hosted? Are you using google cloud or another service?

johanneskruse commented 4 months ago

I am using Amazon Web Services (AWS) to run my remote workers on t3.xlarge instances (https://aws.amazon.com/ec2/instance-types/)

ihsaan-ullah commented 4 months ago

Can you do the following to see if this helps:

  1. create another queue on codabench.
  2. Stop your compute worker
  3. Change the broker URL in your .env in your worker
  4. Remove the compute worker image in your worker
  5. Create the the worker again by following the instructions.
  6. submit new submission
ihsaan-ullah commented 4 months ago

If this does not work, create a compute worker on Google Cloud and see if that works.

johanneskruse commented 4 months ago

I have tried to follow your steps; unfortunately, it didn't change anything. I assume Codabench is run on Google Cloud. What are you running?

ihsaan-ullah commented 4 months ago

Have you tried a Google Cloud worker?

johanneskruse commented 4 months ago

So far I've just been working with AWS and their cloud workers

ihsaan-ullah commented 4 months ago

Please try google cloud. That should work. I am not sure what is the problem with AWS. Or maybe contact AWS support maybe they will help you

johanneskruse commented 4 months ago

Just to check if there is anything unusual. These are the output logs when starting my remote worker:

~$ docker logs -f compute_worker

/usr/local/lib/python3.8/site-packages/celery/platforms.py:800: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  warnings.warn(RuntimeWarning(ROOT_DISCOURAGED.format(

 -------------- compute-worker@1f29b478b104 v4.4.0 (cliffs)
--- ***** ----- 
-- ******* ---- Linux-5.15.0-1057-aws-x86_64-with-glibc2.34 2024-05-26 06:40:33
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         __main__:0x7fcc8a308a30
- ** ---------- .> transport:   amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/394a4b70-ae72-4680-8bb8-4c956f7ed2f3
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> compute-worker   exchange=compute-worker(direct) key=compute-worker

[tasks]
  . compute_worker_run

[2024-05-26 06:40:34,247: INFO/MainProcess] Connected to amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/394a4b70-ae72-4680-8bb8-4c956f7ed2f3
[2024-05-26 06:40:34,399: INFO/MainProcess] mingle: searching for neighbors
[2024-05-26 06:40:35,774: INFO/MainProcess] mingle: all alone
[2024-05-26 06:40:36,142: INFO/MainProcess] compute-worker@1f29b478b104 ready.
johanneskruse commented 4 months ago

Also, have you made a guide how to set it up using Google Cloud Compute?

johanneskruse commented 4 months ago

And just for reference, this happened for all my remote workers:

Screenshot 2024-05-26 at 20 17 37
ihsaan-ullah commented 4 months ago

Logs look good.

For Google cloud the guidelines are the same. You have to go to google cloud console. Select VM instances. Click Create Instance (select storage, memory, location of VM, allow http/https traffic). You can access the VM through a shell provided on the console. The rest of the setup is the same

johanneskruse commented 4 months ago

I've set up remote working using Google Cloud; however, unfortunately, this hasn't changed anything.

ihsaan-ullah commented 4 months ago

Please repeat the google cloud with a new queue.

johanneskruse commented 4 months ago

With a new queue I get the following:

[2024-05-27 05:03:13,245: INFO/MainProcess] Connected to amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/ff428499-5eab-44cf-8c25-6f2680a0a956
[2024-05-27 05:03:13,966: INFO/MainProcess] mingle: searching for neighbors
[2024-05-27 05:03:17,346: INFO/MainProcess] mingle: all alone
[2024-05-27 05:03:17,916: CRITICAL/MainProcess] Unrecoverable error: PreconditionFailed(406, "PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost 'ff428499-5eab-44cf-8c25-6f2680a0a956': received the value '10' of type 'signedint' but current is none", (50, 10), 'Queue.declare')
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/site-packages/celery/worker/worker.py", line 205, in start
    self.blueprint.start(self)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 369, in start
    return self.obj.start()
  File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 318, in start
    blueprint.start(self)
  File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
    step.start(parent)
  File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/tasks.py", line 40, in start
    c.task_consumer = c.app.amqp.TaskConsumer(
  File "/usr/local/lib/python3.8/site-packages/celery/app/amqp.py", line 301, in TaskConsumer
    return self.Consumer(
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 386, in __init__
    self.revive(self.channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 408, in revive
    self.declare()
  File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 421, in declare
    queue.declare()
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 611, in declare
    self._create_queue(nowait=nowait, channel=channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 620, in _create_queue
    self.queue_declare(nowait=nowait, passive=False, channel=channel)
  File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 648, in queue_declare
    ret = channel.queue_declare(
  File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 1148, in queue_declare
    return queue_declare_ok_t(*self.wait(
  File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 88, in wait
    self.connection.drain_events(timeout=timeout)
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 508, in drain_events
    while not self.blocking_read(timeout):
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 514, in blocking_read
    return self.on_inbound_frame(frame)
  File "/usr/local/lib/python3.8/site-packages/amqp/method_framing.py", line 55, in on_frame
    callback(channel, method_sig, buf, None)
  File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 520, in on_inbound_method
    return self.channels[channel_id].dispatch_method(
  File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
    listener(*args)
  File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 279, in _on_close
    raise error_for_code(
amqp.exceptions.PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost 'ff428499-5eab-44cf-8c25-6f2680a0a956': received the value '10' of type 'signedint' but current is none

Do you have any suggestions? I'm not sure if it's on Codabench or google cloud.

ihsaan-ullah commented 4 months ago

Not sure what is happening there.

Can you please list down the steps you are following to setup a worker and then linking it to your queue. @Didayolo do you have any idea?

By the way I was recently using Google Cloud workers for a competition and I haven't faced any issue like this

johanneskruse commented 4 months ago

I follow the guide you have provided step-by-step.

On both AWS and Google Cloud.

  1. Create Instance
  2. docker pull codalab/competitions-v2-compute-worker
  3. Make .env file
  4. Make a broken URL and copy it to the.env file
  5. Run the docker run \ (...)

Everything was been working perfectly until Friday.

Didayolo commented 4 months ago

Everything was been working perfectly until Friday.

You mean that you were able to process submissions on your workers in the past, and it stopped working?

UppalAnshuk commented 4 months ago

Yes, that's what has happened at our end. Taking over from @johanneskruse .

UppalAnshuk commented 4 months ago

Adding a few details, I've set up a new smaller version of the competition, the submissions for this new dummy competition run on the default queue without any issues but when I set up the new queue attached to a gcp worker, the submissions are stuck on "Submitting" and the worker gets no traffic, logs remain unchanged.

The steup of the GCP worker is the same as @ihsaan-ullah has provided above.

Logs look good.

For Google cloud the guidelines are the same. You have to go to google cloud console. Select VM instances. Click Create Instance (select storage, memory, location of VM, allow http/https traffic). You can access the VM through a shell provided on the console.

We are generally following the steps given here for CPU workers - https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup

Thanks

julianevanneeleman commented 4 months ago

I'm running into this issue too, which can be reproduced by creating a competition using the example in https://github.com/codalab/competition-examples/tree/master/codabench/wheat_seeds.

This creates a benchmark, with the private queue configured as indicated by the GUI.

Following the steps for running a compute worker (https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup), create a .env like this:

# Queue URL
BROKER_URL=<REDACTED>

# Location to store submissions/cache -- absolute path!
HOST_DIRECTORY=/codabench

# If SSL isn't enabled, then comment or remove the following line
BROKER_USE_SSL=True

And spin a worker locally:

docker run \                                          
    -v /codabench:/codabench \
    -v /var/run/docker.sock:/var/run/docker.sock \
    -d \
    --env-file .env \
    --name compute_worker \
    --restart unless-stopped \
    --log-opt max-size=50m \
    --log-opt max-file=3 \
    codalab/competitions-v2-compute-worker:latest

This container runs with the following logs:

/usr/local/lib/python3.8/site-packages/celery/platforms.py:800: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!

Please specify a different user using the --uid option.

User information: uid=0 euid=0 gid=0 egid=0

  warnings.warn(RuntimeWarning(ROOT_DISCOURAGED.format(

 -------------- compute-worker@11bfd8fffc9e v4.4.0 (cliffs)
--- ***** ----- 
-- ******* ---- Linux-6.5.0-1023-oem-x86_64-with-glibc2.34 2024-05-27 10:48:23
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         __main__:0x76d8c5837a30
- ** ---------- .> transport:   amqp://f6fb6c54-275d-4137-8212-f3fe87e4fc24:**@www.codabench.org:5672/699b95f9-b3a3-40a2-9701-02ebcd7d3158
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> compute-worker   exchange=compute-worker(direct) key=compute-worker

[tasks]
  . compute_worker_run

[2024-05-27 10:48:23,877: INFO/MainProcess] Connected to amqp://f6fb6c54-275d-4137-8212-f3fe87e4fc24:**@www.codabench.org:5672/699b95f9-b3a3-40a2-9701-02ebcd7d3158
[2024-05-27 10:48:24,315: INFO/MainProcess] mingle: searching for neighbors
[2024-05-27 10:48:25,795: INFO/MainProcess] mingle: all alone
[2024-05-27 10:48:26,267: INFO/MainProcess] compute-worker@11bfd8fffc9e ready.

Then, submitting the default sample sample_code_submission.zip, it shows up in https://www.codabench.org/server_status:

image

It hangs from here.

julianevanneeleman commented 4 months ago

After reading https://github.com/codalab/codabench/issues/1457 and running the steps in my previous post without setting up a custom queue, I can confirm that even the default queue does not process jobs anymore.

Are all competitions currently halted?

Didayolo commented 4 months ago

@julianevanneeleman

Indeed, we have some issues with the default queue. Right now it is working again, by we are actively investigating the problem to avoid it happening again. You can follow this second problem here: #1446

Didayolo commented 4 months ago

On my side I am not able to reproduce the problem. I tried several custom queues and several workers, and they are receiving and computing the submissions without problem.

Can you retry? Maybe it was linked to other problems of the platform (queue congestion, ...)

johanneskruse commented 4 months ago

My workers are now receiving submissions again! I haven't changed or done anything on my end.

Thank you for the shift actions; I hope it stays stable for now. I will follow #1446 closely.

Didayolo commented 4 months ago

Everything is probably linked. I'm closing this issue and keeping the other one open.

johanneskruse commented 3 months ago

Hi @Didayolo - it is happening again. None of my workers are receiving any submissions they are all in limbo. No error logs.

Is there an explanation?

image
Didayolo commented 3 months ago

Indeed I can see that the submissions are stuck in "Submitted" on your queue: https://www.codabench.org/server_status

I don't know the reason. I'll investigate this.

johanneskruse commented 3 months ago

Hi @Didayolo, thank you for getting back!

Are there anything we can do in the mean while or help to debugging this issue?

johanneskruse commented 3 months ago

I did start a new issue #1473 with more error logs.

johanneskruse commented 3 months ago

Indeed I can see that the submissions are stuck in "Submitted" on your queue: https://www.codabench.org/server_status

I don't know the reason. I'll investigate this.

I'm now able to access the server_status; however, it doesn't register the newly assigned queue, and when using RecSys24_competition_v4, I get the error from #1473.

image

johanneskruse commented 3 months ago

I tried to remove RecSys24_competition_v4. Now the queue is always *.

johanneskruse commented 3 months ago

It's now showing the new queue, but there is still no connection to the worker:

image