Open rgov opened 2 years ago
I just want to confirm that on the jobs page (https://viame.kitware.com/#/jobs) you toggled on "Enable Private Runner Queue" and then ran the docker quickstart command with your proper credentials
I just some testing with my local machine to make sure the private queues are working and they seem to be.
Could it be that the internal HPC node isn't allowed to connect to some outside servers? The AMPQ broker on AWS handles management but when a task is received it then tells your local machine to grab the data from the server and transfer it for the job to complete. The celery queue should be specified as usname@private (https://github.com/Kitware/dive/blob/9bac02336c0482fae273b9e6beab1ebe7a9e1448/server/dive_tasks/celeryconfig.py#L37) unless there are other things missing.
Let me know if you have other questions about this or if I can help further.
I just want to confirm that on the jobs page (https://viame.kitware.com/#/jobs) you toggled on "Enable Private Runner Queue" and then ran the docker quickstart command with your proper credentials
Yes I believe so. Full disclosure, I am using a different containerization technology (Singularity) with the official viame-worker
container image. It shouldn't affect something like this.
Could it be that the internal HPC node isn't allowed to connect to some outside servers?
I don't think so. It is able to connect to Amazon, but it specifically is getting a 403 ACCESS_DENIED error, not some other error about not being able to connect.
The celery queue should be specified as usname@private unless there are other things missing.
I have no idea. The celeryconfig
is definitely being loaded because it is picking up the remote transport, etc. for the private queue.
I just some testing with my local machine to make sure the private queues are working and they seem to be.
Have you tried this with a brand new user without an existing queue on the backend? Do you actually see user@private
being used as the queue name in the Celery logs?
Aside from trying to figure out why the queue name override is not working, any objection to changing the regex... clearly Celery is capable of generating a celery@XXX
queue name where XXX
is not purely hexadecimal. Here's the code that generates it:
https://github.com/celery/kombu/blob/14d395aa859b905874d8b4abd677a4c7ac86e10b/kombu/pidbox.py#L251
As a workaround on my side, I did discover I can override the container's hostname, so I was able to force it to a hostname that passes the regex:
[2022-10-04 14:23:14,243: INFO/MainProcess] celery@deadbeef ready.
To be clear, there is still probably a bug here with DIVE.
I am playing with running the private worker on our HPC cluster. It spins up with:
The access control policy for this queue is given by the following regular expression:
https://github.com/Kitware/dive/blob/bae06509ffa7c547433b6c791242e22e547ad38b/server/rabbitmq_user_queues/views.py#L67
Note that the queue name is
celery@gpu001.celery.pidbox
. This seems to be using my local computer's hostname (gpu001
) but assuming it is a UUID of some kind entirely composed of hexadecimal characters,[a-fA-F0-9-]+
.Aside: I would recommend using
re.escape(user['login'])
rather than inserting it directly into the regular expression, in case there are special characters.