codalab / codabench

Codabench is a flexible, easy-to-use and reproducible benchmarking platform. Check our paper at Patterns Cell Press https://hubs.li/Q01fwRWB0
Apache License 2.0
72 stars 28 forks source link

Unable to create compute workers #690

Closed Didayolo closed 2 years ago

Didayolo commented 2 years ago

I have trouble creating compute workers.

pyamqp://26df2d3e-345e-472b-[...]-6e1ae58c4202:6a7f18d6-85a0-455f-[...]-9903297abd15@rabbit:5672/f561486f-803a-[...]-ab79-d8fecb6badf1

Shouldn't the domain appears somewhere here? Like codabench.org instead of rabbit?

I get the following error:

[2022-03-28 14:38:15,413: ERROR/MainProcess] consumer: Cannot connect to amqp://26df2d3e-345e-472b-9d41-6e1ae58c4202:**@rabbit:5672/f561486f-803a-4612-ab79-d8fecb6badf1: failed to resolve broker hostname.
Trying again in 2.00 seconds...

[2022-03-28 14:38:17,438: ERROR/MainProcess] consumer: Cannot connect to amqp://26df2d3e-345e-472b-9d41-6e1ae58c4202:**@rabbit:5672/f561486f-803a-4612-ab79-d8fecb6badf1: failed to resolve broker hostname.
Trying again in 4.00 seconds...

Or, when I edit the broker URL by hand:

[2022-03-28 14:34:47,967: ERROR/MainProcess] consumer: Cannot connect to amqp://[...]:**@codabench.org:5672//: [Errno 99] Cannot assign requested address.
Trying again in 2.00 seconds...

[2022-03-28 14:34:58,097: ERROR/MainProcess] consumer: Cannot connect to amqp://[...]:**@codabench.org:5672//: [Errno 99] Cannot assign requested address.
Trying again in 4.00 seconds...
Didayolo commented 2 years ago

Also, what is this for?

In .env:

# Location to store submissions/cache -- absolute path!
HOST_DIRECTORY=/your/path/to/codabench/storage

In the setup command:

-v /your/path/to/codabench/storage:/codabench \

EDIT: simply create a folder and reference it as HOST_DIRECTORY. This folder will be shared between the container and the host (compute worker).

Didayolo commented 2 years ago

UPDATE: for a GPU worker, I tried to manually change rabbit to www.codabench.org

Still not working:

[2022-08-31 13:08:45,434: ERROR/MainProcess] consumer: Cannot connect to amqp://2e6b227b-c960-44c4-8104-505e9f45077f:**@www.codabench.org:5672/11401520-1b5e-4290-bc2c-6f17526c0343: [SSL: WRONG_VERSION_NUMBER] wrong version number (_ssl.c:1108).
Trying again in 2.00 seconds...

EDIT: Finally, by removing BROKER_USE_SSL=True from the .env file, the worker is connected to the queue:

[2022-08-31 14:15:08,167: INFO/MainProcess] Connected to amqp://2e6b2[...]-c960-44c[...]:**@www.codabench.org:5672/1140152[...]
[2022-08-31 14:15:08,180: INFO/MainProcess] mingle: searching for neighbors
[2022-08-31 14:15:09,220: INFO/MainProcess] mingle: all alone
[2022-08-31 14:15:09,258: INFO/MainProcess] compute-worker@37ca1eecf812 ready.
Didayolo commented 2 years ago

@dtuantran @bbearce

What about this? It is working now, right?

bbearce commented 2 years ago

[image: image.png]

I see on FF that codabench.org is working. Is that what you meant? I also see master is passing!!! :)

On Wed, Oct 5, 2022 at 4:38 AM Adrien Pavão @.***> wrote:

@dtuantran https://github.com/dtuantran @bbearce https://github.com/bbearce

What about this? It is working now, right?

— Reply to this email directly, view it on GitHub https://github.com/codalab/codabench/issues/690#issuecomment-1268125924, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2LN36N3EZFXP4M2VNB43LWBU47VANCNFSM5R3MNFXA . You are receiving this because you were mentioned.Message ID: @.***>

Didayolo commented 2 years ago

@bbearce

I was thinking of the setup of compute workers. Is it working fine now? For CPU and GPU? And by just following the wiki instructions?

Thanks

bbearce commented 2 years ago

I had no trouble with the wiki for deploying these. Seems Tuan and Anne-Catherine had issues behind the firewall which needed special care.

On Thu, Oct 6, 2022 at 8:34 AM Adrien Pavão @.***> wrote:

@bbearce https://github.com/bbearce

I was thinking of the setup of compute workers. Is it working fine now? For CPU and GPU? And by just following the wiki instructions?

Thanks

— Reply to this email directly, view it on GitHub https://github.com/codalab/codabench/issues/690#issuecomment-1269960239, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB2LN35F7CAIZFBYC6BEOJ3WB3BNZANCNFSM5R3MNFXA . You are receiving this because you were mentioned.Message ID: @.***>

Didayolo commented 2 years ago

I've improved the documentation.

Maybe the remaining problem is the BROKER URL generated by CodaBench:

Shouldn't the domain appears somewhere here? Like codabench.org instead of rabbit?

bbearce commented 2 years ago

So I agree. In docker-compose.yml under compute_worker service we see:

BROKER_URL=pyamqp://${RABBITMQ_DEFAULT_USER}:${RABBITMQ_DEFAULT_PASS}@${RABBITMQ_HOST}:${RABBITMQ_PORT}//

I think it would be better if we used:

BROKER_URL=pyamqp://${RABBITMQ_DEFAULT_USER}:${RABBITMQ_DEFAULT_PASS}@${DOMAIN_NAME}:${RABBITMQ_PORT}//

Also we should look into the code that generates BROKER_URLs when making queues manually and have it use the DOMAIN_NAME as well. I'm curious if anyone thinks this is a bad idea? One potential issue is if rabbit doesn't publish it's ports it may be unreachable but I don't think that is the case as I see in the "rabbit" service ports are being published. This makes sense as rabbit is reachable from other VMs.

acletournel commented 2 years ago

If you have a DOMAIN_NAME available, it should be a good idea to generate the BROKER_URL from it in the queue management page. For now, I have edited the Queue Management wiki page to indicate how the copy-pasted BROKER_URL should be modified ( codabench.org instead of rabbit ).

Didayolo commented 2 years ago

Fix : https://github.com/codalab/codabench/pull/728