Azure / cyclecloud-slurm

Azure CycleCloud project to enable users to create, configure, and use Slurm HPC clusters.
MIT License
58 stars 43 forks source link

Slurm nodenames do not match azure hostnames - so head node cannot communicate with nodes #105

Closed garymansell closed 1 year ago

garymansell commented 1 year ago

This is my first stab at setting up Azure HPC using CycleCloud and Slurm, so forgive me for stupid mistakes...

Also, if this is the wrong place for this - please point me to where I should post support questions...

I have built a simple (default) Slurm cluster using CycleCloud and the nodes start/stop OK, but when I run a simple (hostname) job it just hangs at "Completing".

Debug logging seems to suggest that the Slurm Scheduler node cannot communicate with the nodes:

[azccadmin@SlurmCluster-1-scheduler data]$ srun -N2 -n2 -t00:15:00 -Jgrma-hostname hostname.sh srun: error: get_addr_info: getaddrinfo() failed: Name or service not known srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-1" srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-1, check slurm.conf srun: error: get_addr_info: getaddrinfo() failed: Name or service not known srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-2" srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-2, check slurm.conf srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-2: Can't find an address, check slurm.conf srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-1: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf srun: Job step aborted

I cannot ping the node from the Scheduler node:

[azccadmin@SlurmCluster-1-scheduler ~]$ ping slurmcluster-1-hpc-pg0-1 ping: slurmcluster-1-hpc-pg0-1: Name or service not known

Slurm is trying to talk to the node as "slurmcluster-1-hpc-pg0-1", but the hostname of the node is actually "slurmcluster-1-slurmcluster-1-hpc-pg0-1"

And I can ping it with this name from the Scheduler node:

[azccadmin@SlurmCluster-1-scheduler data]$ ping slurmcluster-1-slurmcluster-1-hpc-pg0-1 PING slurmcluster-1-slurmcluster-1-hpc-pg0-1.yr5ran05dk5uzlz13kqz0cq4xe.ax.internal.cloudapp.net (192.168.140.6) 56(84) bytes of data. 64 bytes from slurmcluster-1-slurmcluster-1-hpc-pg0-1.internal.cloudapp.net (192.168.140.6): icmp_seq=1 ttl=64 time=1.77 ms

I am using CycleCloud 8.3 - which I note has a fix for Slurm NodeName / Azure Hostname (but this seems to still be an issue)?

Thanks

Gary

anhoward commented 1 year ago

Are you using Azure DNS or a custom DNS server? Could you share a screenshot of the "Advanced Settings" tab on your cluster so I can see what things are set to? It almost looks like your cluster name prefix is in there twice.

garymansell commented 1 year ago

Hey Andy! Superstar for getting back to me, thanks.

Am using Azure DNS Server.

This is what I used (default apart from the image, as I could not get the almalinux to work):

I presume that it must be something to do with the Node Prefix setting - as I can see this can be set to null instead of "Cluster Prefix" (which might fix the issue)?

[image: image.png]

Rgds

Gary

On Fri, 9 Dec 2022 at 14:34, anhoward @.***> wrote:

Are you using Azure DNS or a custom DNS server? Could you share a screenshot of the "Advanced Settings" tab on your cluster so I can see what things are set to? It almost looks like your cluster name prefix is in there twice.

— Reply to this email directly, view it on GitHub https://github.com/Azure/cyclecloud-slurm/issues/105#issuecomment-1344381997, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGHUS6NCRDVFICN2L5RAX4DWMM7NVANCNFSM6AAAAAASZHXZ4Q . You are receiving this because you authored the thread.Message ID: @.***>

garymansell commented 1 year ago

Yeah - I have just used the null value for "Node Prefix" in the template:

[image: image.png]

And the resulting node hostname matches the node name now - so maybe this will work (haven't had chance to test yet)?:

[image: image.png]

But, something is a bit screwy there - as surely (if you select the Node Prefix) - the cluster head node should/needs to be able to communicate with the nodes still.

Also, I edited the cluster in the GUI to change the Node Prefix to Null and then re-deployed, but when I go back to edit/check it again - it has reverted to "Node Prefix" again (even though it build the new nodes without the node prefix):

[image: image.png]

Regards

Gary

On Fri, 9 Dec 2022 at 14:45, Gary Mansell @.***> wrote:

Hey Andy! Superstar for getting back to me, thanks.

Am using Azure DNS Server.

This is what I used (default apart from the image, as I could not get the almalinux to work):

I presume that it must be something to do with the Node Prefix setting - as I can see this can be set to null instead of "Cluster Prefix" (which might fix the issue)?

[image: image.png]

Rgds

Gary

On Fri, 9 Dec 2022 at 14:34, anhoward @.***> wrote:

Are you using Azure DNS or a custom DNS server? Could you share a screenshot of the "Advanced Settings" tab on your cluster so I can see what things are set to? It almost looks like your cluster name prefix is in there twice.

— Reply to this email directly, view it on GitHub https://github.com/Azure/cyclecloud-slurm/issues/105#issuecomment-1344381997, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGHUS6NCRDVFICN2L5RAX4DWMM7NVANCNFSM6AAAAAASZHXZ4Q . You are receiving this because you authored the thread.Message ID: @.***>

anhoward commented 1 year ago

For some reason Github has swallowed the image you attached :). Could you email me directly at @microsoft.com?

Thanks!

garymansell commented 1 year ago

sure - what is your microsoft.com email addy?

On Fri, 9 Dec 2022 at 14:54, anhoward @.***> wrote:

For some reason Github has swallowed the image you attached :). Could you email me directly at @microsoft.com?

Thanks!

— Reply to this email directly, view it on GitHub https://github.com/Azure/cyclecloud-slurm/issues/105#issuecomment-1344402795, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGHUS6P7ZDVATKDVHT4BXQDWMNBY7ANCNFSM6AAAAAASZHXZ4Q . You are receiving this because you authored the thread.Message ID: @.***>

anhoward commented 1 year ago

Ahh github got clever with what I tried to do. It's just my username here.