aau-claaudia / aicloud

Everything related to aicloud
2 stars 0 forks source link

Enable SSH login on compute node user has job on #27

Open ThomasA opened 1 year ago

ThomasA commented 1 year ago

I believe it is considered "normal" Slurm behaviour to let users log into compute nodes via SSH while they have a job running on the node in question. This should work:

salloc --cpus-per-task=2 --nodelist [node] ssh [node]

Yet, it only works for me on a256-t4-[02-03].srv.aau.dk and nv-ai-04.srv.aau.dk. All the other compute nodes cut me off when I try this:

we12ec@its.aau.dk@ai-fe02:~$ salloc --cpus-per-task=2 --nodelist a256-a40-04.srv.aau.dk ssh a256-a40-04.srv.aau.dk
salloc: Granted job allocation 26263
salloc: Waiting for resource configuration
salloc: Nodes a256-a40-04.srv.aau.dk are ready for job
we12ec@its.aau.dk@a256-a40-04.srv.aau.dk's password: 
Connection closed by 172.21.212.187 port 22
salloc: Relinquishing job allocation 26263
salloc: Job allocation 26263 has been revoked.

Something in the configuration on the latter nodes must differ in a way that makes this not work.
We need to figure out what the problem is and fix it. SSH to compute nodes during running jobs is necessary for example to inspect GPU utilisation via nvidia-smi.

fasmide commented 1 year ago

I've looked around on the current cluster and indeed, there seem to be some differences on their respective sssd.conf files - these files are managed by another ansible project, so I'm not sure how to fix this down the road - but for now we will simply have to change these deployed files by hand

We should keep this issue open for a long-term solution

ThomasA commented 1 year ago

It works for me now. Maybe we should document what the fix is, so it can eventually be included in the Ansible project that is responsible for that part of the set-up.