Access control issue with private runner

rgov commented 2 years ago

I am playing with running the private worker on our HPC cluster. It spins up with:

 -------------- celery@gpu001 v4.4.7 (cliffs)
--- ***** ----- 
-- ******* ---- Linux-3.10.0-693.2.2.el7.x86_64-x86_64-with-glibc2.2.5 2022-10-02 23:51:09
- *** --- * --- 
- ** ---------- [config]
- ** ---------- .> app:         girder_worker:0x2aaaadcd1eb0
- ** ---------- .> transport:   amqps://g-prod-rgovostes:**@b-476e733d-9ed5-491c-90d6-f4f2847e9c53.mq.us-east-2.amazonaws.com:5671/prod
- ** ---------- .> results:     disabled://
- *** --- * --- .> concurrency: 2 (prefork)
-- ******* ---- .> task events: OFF (enable -E to monitor tasks in this worker)
--- ***** ----- 
 -------------- [queues]
                .> rgovostes@private exchange=rgovostes@private(direct) key=rgovostes@private

[2022-10-02 23:51:09,593: INFO/MainProcess] Connected to amqps://g-prod-rgovostes:**@b-476e733d-9ed5-491c-90d6-f4f2847e9c53.mq.us-east-2.amazonaws.com:5671/prod
[2022-10-02 23:51:10,021: CRITICAL/MainProcess] Unrecoverable error: AccessRefused(403, "ACCESS_REFUSED - access to queue 'celery@gpu001.celery.pidbox' in vhost 'prod' refused for user 'g-prod-rgovostes'", (50, 10), 'Queue.declare')

The access control policy for this queue is given by the following regular expression:

https://github.com/Kitware/dive/blob/bae06509ffa7c547433b6c791242e22e547ad38b/server/rabbitmq_user_queues/views.py#L67

Note that the queue name is celery@gpu001.celery.pidbox. This seems to be using my local computer's hostname (gpu001) but assuming it is a UUID of some kind entirely composed of hexadecimal characters, [a-fA-F0-9-]+.

Aside: I would recommend using re.escape(user['login']) rather than inserting it directly into the regular expression, in case there are special characters.

BryonLewis commented 2 years ago

I just want to confirm that on the jobs page (https://viame.kitware.com/#/jobs) you toggled on "Enable Private Runner Queue" and then ran the docker quickstart command with your proper credentials

I just some testing with my local machine to make sure the private queues are working and they seem to be.

Could it be that the internal HPC node isn't allowed to connect to some outside servers? The AMPQ broker on AWS handles management but when a task is received it then tells your local machine to grab the data from the server and transfer it for the job to complete. The celery queue should be specified as usname@private (https://github.com/Kitware/dive/blob/9bac02336c0482fae273b9e6beab1ebe7a9e1448/server/dive_tasks/celeryconfig.py#L37) unless there are other things missing.

Let me know if you have other questions about this or if I can help further.

rgov commented 2 years ago

I just want to confirm that on the jobs page (https://viame.kitware.com/#/jobs) you toggled on "Enable Private Runner Queue" and then ran the docker quickstart command with your proper credentials

Yes I believe so. Full disclosure, I am using a different containerization technology (Singularity) with the official viame-worker container image. It shouldn't affect something like this.

Could it be that the internal HPC node isn't allowed to connect to some outside servers?

I don't think so. It is able to connect to Amazon, but it specifically is getting a 403 ACCESS_DENIED error, not some other error about not being able to connect.

The celery queue should be specified as usname@private unless there are other things missing.

I have no idea. The celeryconfig is definitely being loaded because it is picking up the remote transport, etc. for the private queue.

I just some testing with my local machine to make sure the private queues are working and they seem to be.

Have you tried this with a brand new user without an existing queue on the backend? Do you actually see user@private being used as the queue name in the Celery logs?

Aside from trying to figure out why the queue name override is not working, any objection to changing the regex... clearly Celery is capable of generating a celery@XXX queue name where XXX is not purely hexadecimal. Here's the code that generates it:

https://github.com/celery/kombu/blob/14d395aa859b905874d8b4abd677a4c7ac86e10b/kombu/pidbox.py#L251

rgov commented 2 years ago

As a workaround on my side, I did discover I can override the container's hostname, so I was able to force it to a hostname that passes the regex:

[2022-10-04 14:23:14,243: INFO/MainProcess] celery@deadbeef ready.

To be clear, there is still probably a bug here with DIVE.

Kitware / dive

Access control issue with private runner #1298