Closed johanneskruse closed 4 months ago
Hi @johanneskruse
This is strange. We haven't changed anything that will break something. Can you please confirm the following:
hi,
Thanks for the quick response.
Where are your workers hosted? Are you using google cloud or another service?
I am using Amazon Web Services (AWS) to run my remote workers on t3.xlarge instances (https://aws.amazon.com/ec2/instance-types/)
Can you do the following to see if this helps:
If this does not work, create a compute worker on Google Cloud and see if that works.
I have tried to follow your steps; unfortunately, it didn't change anything. I assume Codabench is run on Google Cloud. What are you running?
Have you tried a Google Cloud worker?
So far I've just been working with AWS and their cloud workers
Please try google cloud. That should work. I am not sure what is the problem with AWS. Or maybe contact AWS support maybe they will help you
Just to check if there is anything unusual. These are the output logs when starting my remote worker:
~$ docker logs -f compute_worker
/usr/local/lib/python3.8/site-packages/celery/platforms.py:800: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!
Please specify a different user using the --uid option.
User information: uid=0 euid=0 gid=0 egid=0
warnings.warn(RuntimeWarning(ROOT_DISCOURAGED.format(
-------------- compute-worker@1f29b478b104 v4.4.0 (cliffs)
--- ***** -----
-- ******* ---- Linux-5.15.0-1057-aws-x86_64-with-glibc2.34 2024-05-26 06:40:33
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: __main__:0x7fcc8a308a30
- ** ---------- .> transport: amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/394a4b70-ae72-4680-8bb8-4c956f7ed2f3
- ** ---------- .> results: disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> compute-worker exchange=compute-worker(direct) key=compute-worker
[tasks]
. compute_worker_run
[2024-05-26 06:40:34,247: INFO/MainProcess] Connected to amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/394a4b70-ae72-4680-8bb8-4c956f7ed2f3
[2024-05-26 06:40:34,399: INFO/MainProcess] mingle: searching for neighbors
[2024-05-26 06:40:35,774: INFO/MainProcess] mingle: all alone
[2024-05-26 06:40:36,142: INFO/MainProcess] compute-worker@1f29b478b104 ready.
Also, have you made a guide how to set it up using Google Cloud Compute?
And just for reference, this happened for all my remote workers:
Logs look good.
For Google cloud the guidelines are the same. You have to go to google cloud console. Select VM instances. Click Create Instance (select storage, memory, location of VM, allow http/https traffic). You can access the VM through a shell provided on the console. The rest of the setup is the same
I've set up remote working using Google Cloud; however, unfortunately, this hasn't changed anything.
Please repeat the google cloud with a new queue.
With a new queue I get the following:
[2024-05-27 05:03:13,245: INFO/MainProcess] Connected to amqp://63a35e45-cb28-4eed-9c2c-af8072bf9d9c:**@www.codabench.org:5672/ff428499-5eab-44cf-8c25-6f2680a0a956
[2024-05-27 05:03:13,966: INFO/MainProcess] mingle: searching for neighbors
[2024-05-27 05:03:17,346: INFO/MainProcess] mingle: all alone
[2024-05-27 05:03:17,916: CRITICAL/MainProcess] Unrecoverable error: PreconditionFailed(406, "PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost 'ff428499-5eab-44cf-8c25-6f2680a0a956': received the value '10' of type 'signedint' but current is none", (50, 10), 'Queue.declare')
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/celery/worker/worker.py", line 205, in start
self.blueprint.start(self)
File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 369, in start
return self.obj.start()
File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/usr/local/lib/python3.8/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/usr/local/lib/python3.8/site-packages/celery/worker/consumer/tasks.py", line 40, in start
c.task_consumer = c.app.amqp.TaskConsumer(
File "/usr/local/lib/python3.8/site-packages/celery/app/amqp.py", line 301, in TaskConsumer
return self.Consumer(
File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 386, in __init__
self.revive(self.channel)
File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 408, in revive
self.declare()
File "/usr/local/lib/python3.8/site-packages/kombu/messaging.py", line 421, in declare
queue.declare()
File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 611, in declare
self._create_queue(nowait=nowait, channel=channel)
File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 620, in _create_queue
self.queue_declare(nowait=nowait, passive=False, channel=channel)
File "/usr/local/lib/python3.8/site-packages/kombu/entity.py", line 648, in queue_declare
ret = channel.queue_declare(
File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 1148, in queue_declare
return queue_declare_ok_t(*self.wait(
File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 88, in wait
self.connection.drain_events(timeout=timeout)
File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 508, in drain_events
while not self.blocking_read(timeout):
File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 514, in blocking_read
return self.on_inbound_frame(frame)
File "/usr/local/lib/python3.8/site-packages/amqp/method_framing.py", line 55, in on_frame
callback(channel, method_sig, buf, None)
File "/usr/local/lib/python3.8/site-packages/amqp/connection.py", line 520, in on_inbound_method
return self.channels[channel_id].dispatch_method(
File "/usr/local/lib/python3.8/site-packages/amqp/abstract_channel.py", line 145, in dispatch_method
listener(*args)
File "/usr/local/lib/python3.8/site-packages/amqp/channel.py", line 279, in _on_close
raise error_for_code(
amqp.exceptions.PreconditionFailed: Queue.declare: (406) PRECONDITION_FAILED - inequivalent arg 'x-max-priority' for queue 'compute-worker' in vhost 'ff428499-5eab-44cf-8c25-6f2680a0a956': received the value '10' of type 'signedint' but current is none
Do you have any suggestions? I'm not sure if it's on Codabench or google cloud.
Not sure what is happening there.
Can you please list down the steps you are following to setup a worker and then linking it to your queue. @Didayolo do you have any idea?
By the way I was recently using Google Cloud workers for a competition and I haven't faced any issue like this
I follow the guide you have provided step-by-step.
On both AWS and Google Cloud.
.env
file.env
file docker run \ (...)
Everything was been working perfectly until Friday.
Everything was been working perfectly until Friday.
You mean that you were able to process submissions on your workers in the past, and it stopped working?
Yes, that's what has happened at our end. Taking over from @johanneskruse .
Adding a few details, I've set up a new smaller version of the competition, the submissions for this new dummy competition run on the default queue without any issues but when I set up the new queue attached to a gcp worker, the submissions are stuck on "Submitting" and the worker gets no traffic, logs remain unchanged.
The steup of the GCP worker is the same as @ihsaan-ullah has provided above.
Logs look good.
For Google cloud the guidelines are the same. You have to go to google cloud console. Select VM instances. Click Create Instance (select storage, memory, location of VM, allow http/https traffic). You can access the VM through a shell provided on the console.
We are generally following the steps given here for CPU workers - https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup
Thanks
I'm running into this issue too, which can be reproduced by creating a competition using the example in https://github.com/codalab/competition-examples/tree/master/codabench/wheat_seeds.
queue
field to the competition.yaml
file in the example folder, with the value set to the Vhost
of the queuezip -r comp.zip .
inside the example foldercomp.zip
in https://www.codabench.org/competitions/upload/This creates a benchmark, with the private queue configured as indicated by the GUI.
Following the steps for running a compute worker (https://github.com/codalab/codabench/wiki/Compute-Worker-Management---Setup), create a .env
like this:
# Queue URL
BROKER_URL=<REDACTED>
# Location to store submissions/cache -- absolute path!
HOST_DIRECTORY=/codabench
# If SSL isn't enabled, then comment or remove the following line
BROKER_USE_SSL=True
And spin a worker locally:
docker run \
-v /codabench:/codabench \
-v /var/run/docker.sock:/var/run/docker.sock \
-d \
--env-file .env \
--name compute_worker \
--restart unless-stopped \
--log-opt max-size=50m \
--log-opt max-file=3 \
codalab/competitions-v2-compute-worker:latest
This container runs with the following logs:
/usr/local/lib/python3.8/site-packages/celery/platforms.py:800: RuntimeWarning: You're running the worker with superuser privileges: this is
absolutely not recommended!
Please specify a different user using the --uid option.
User information: uid=0 euid=0 gid=0 egid=0
warnings.warn(RuntimeWarning(ROOT_DISCOURAGED.format(
-------------- compute-worker@11bfd8fffc9e v4.4.0 (cliffs)
--- ***** -----
-- ******* ---- Linux-6.5.0-1023-oem-x86_64-with-glibc2.34 2024-05-27 10:48:23
- *** --- * ---
- ** ---------- [config]
- ** ---------- .> app: __main__:0x76d8c5837a30
- ** ---------- .> transport: amqp://f6fb6c54-275d-4137-8212-f3fe87e4fc24:**@www.codabench.org:5672/699b95f9-b3a3-40a2-9701-02ebcd7d3158
- ** ---------- .> results: disabled://
- *** --- * --- .> concurrency: 1 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** -----
-------------- [queues]
.> compute-worker exchange=compute-worker(direct) key=compute-worker
[tasks]
. compute_worker_run
[2024-05-27 10:48:23,877: INFO/MainProcess] Connected to amqp://f6fb6c54-275d-4137-8212-f3fe87e4fc24:**@www.codabench.org:5672/699b95f9-b3a3-40a2-9701-02ebcd7d3158
[2024-05-27 10:48:24,315: INFO/MainProcess] mingle: searching for neighbors
[2024-05-27 10:48:25,795: INFO/MainProcess] mingle: all alone
[2024-05-27 10:48:26,267: INFO/MainProcess] compute-worker@11bfd8fffc9e ready.
Then, submitting the default sample sample_code_submission.zip
, it shows up in https://www.codabench.org/server_status:
It hangs from here.
After reading https://github.com/codalab/codabench/issues/1457 and running the steps in my previous post without setting up a custom queue, I can confirm that even the default queue does not process jobs anymore.
Are all competitions currently halted?
@julianevanneeleman
Indeed, we have some issues with the default queue. Right now it is working again, by we are actively investigating the problem to avoid it happening again. You can follow this second problem here: #1446
On my side I am not able to reproduce the problem. I tried several custom queues and several workers, and they are receiving and computing the submissions without problem.
Can you retry? Maybe it was linked to other problems of the platform (queue congestion, ...)
My workers are now receiving submissions again! I haven't changed or done anything on my end.
Thank you for the shift actions; I hope it stays stable for now. I will follow #1446 closely.
Everything is probably linked. I'm closing this issue and keeping the other one open.
Hi @Didayolo - it is happening again. None of my workers are receiving any submissions they are all in limbo. No error logs.
Is there an explanation?
Indeed I can see that the submissions are stuck in "Submitted" on your queue: https://www.codabench.org/server_status
I don't know the reason. I'll investigate this.
Hi @Didayolo, thank you for getting back!
Are there anything we can do in the mean while or help to debugging this issue?
I did start a new issue #1473 with more error logs.
Indeed I can see that the submissions are stuck in "Submitted" on your queue: https://www.codabench.org/server_status
I don't know the reason. I'll investigate this.
I'm now able to access the server_status
; however, it doesn't register the newly assigned queue, and when using RecSys24_competition_v4
, I get the error from #1473.
I tried to remove RecSys24_competition_v4
. Now the queue is always *
.
It's now showing the new queue, but there is still no connection to the worker:
Dear Codabench team,
My remote workers have stopped receiving any traffic - is there an explanation, recent update, etc. From one day to the next, submissions are not being processed. Everything was good on May 23, 2024, but they stopped working on May 24, 2024.
I have multiple remote workers, and I do see that they are connected when turning on/off:
When using the default CPU queue, I can run submissions; however, due to the 20-minute limit, I have to use my own remote workers.
Link to Ekstra Bladet News Recommendation Competition
Best, Johannes