Closed garymansell closed 1 year ago
Are you using Azure DNS or a custom DNS server? Could you share a screenshot of the "Advanced Settings" tab on your cluster so I can see what things are set to? It almost looks like your cluster name prefix is in there twice.
Hey Andy! Superstar for getting back to me, thanks.
Am using Azure DNS Server.
This is what I used (default apart from the image, as I could not get the almalinux to work):
I presume that it must be something to do with the Node Prefix setting - as I can see this can be set to null instead of "Cluster Prefix" (which might fix the issue)?
[image: image.png]
Rgds
Gary
On Fri, 9 Dec 2022 at 14:34, anhoward @.***> wrote:
Are you using Azure DNS or a custom DNS server? Could you share a screenshot of the "Advanced Settings" tab on your cluster so I can see what things are set to? It almost looks like your cluster name prefix is in there twice.
— Reply to this email directly, view it on GitHub https://github.com/Azure/cyclecloud-slurm/issues/105#issuecomment-1344381997, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGHUS6NCRDVFICN2L5RAX4DWMM7NVANCNFSM6AAAAAASZHXZ4Q . You are receiving this because you authored the thread.Message ID: @.***>
Yeah - I have just used the null value for "Node Prefix" in the template:
[image: image.png]
And the resulting node hostname matches the node name now - so maybe this will work (haven't had chance to test yet)?:
[image: image.png]
But, something is a bit screwy there - as surely (if you select the Node Prefix) - the cluster head node should/needs to be able to communicate with the nodes still.
Also, I edited the cluster in the GUI to change the Node Prefix to Null and then re-deployed, but when I go back to edit/check it again - it has reverted to "Node Prefix" again (even though it build the new nodes without the node prefix):
[image: image.png]
Regards
Gary
On Fri, 9 Dec 2022 at 14:45, Gary Mansell @.***> wrote:
Hey Andy! Superstar for getting back to me, thanks.
Am using Azure DNS Server.
This is what I used (default apart from the image, as I could not get the almalinux to work):
I presume that it must be something to do with the Node Prefix setting - as I can see this can be set to null instead of "Cluster Prefix" (which might fix the issue)?
[image: image.png]
Rgds
Gary
On Fri, 9 Dec 2022 at 14:34, anhoward @.***> wrote:
Are you using Azure DNS or a custom DNS server? Could you share a screenshot of the "Advanced Settings" tab on your cluster so I can see what things are set to? It almost looks like your cluster name prefix is in there twice.
— Reply to this email directly, view it on GitHub https://github.com/Azure/cyclecloud-slurm/issues/105#issuecomment-1344381997, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGHUS6NCRDVFICN2L5RAX4DWMM7NVANCNFSM6AAAAAASZHXZ4Q . You are receiving this because you authored the thread.Message ID: @.***>
For some reason Github has swallowed the image you attached :). Could you email me directly at
Thanks!
sure - what is your microsoft.com email addy?
On Fri, 9 Dec 2022 at 14:54, anhoward @.***> wrote:
For some reason Github has swallowed the image you attached :). Could you email me directly at @microsoft.com?
Thanks!
— Reply to this email directly, view it on GitHub https://github.com/Azure/cyclecloud-slurm/issues/105#issuecomment-1344402795, or unsubscribe https://github.com/notifications/unsubscribe-auth/AGHUS6P7ZDVATKDVHT4BXQDWMNBY7ANCNFSM6AAAAAASZHXZ4Q . You are receiving this because you authored the thread.Message ID: @.***>
Ahh github got clever with what I tried to do. It's just my username here.
This is my first stab at setting up Azure HPC using CycleCloud and Slurm, so forgive me for stupid mistakes...
Also, if this is the wrong place for this - please point me to where I should post support questions...
I have built a simple (default) Slurm cluster using CycleCloud and the nodes start/stop OK, but when I run a simple (hostname) job it just hangs at "Completing".
Debug logging seems to suggest that the Slurm Scheduler node cannot communicate with the nodes:
[azccadmin@SlurmCluster-1-scheduler data]$ srun -N2 -n2 -t00:15:00 -Jgrma-hostname hostname.sh srun: error: get_addr_info: getaddrinfo() failed: Name or service not known srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-1" srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-1, check slurm.conf srun: error: get_addr_info: getaddrinfo() failed: Name or service not known srun: error: slurm_set_addr: Unable to resolve "slurmcluster-1-hpc-pg0-2" srun: error: fwd_tree_thread: can't find address for host slurmcluster-1-hpc-pg0-2, check slurm.conf srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-2: Can't find an address, check slurm.conf srun: error: Task launch for StepId=1.0 failed on node slurmcluster-1-hpc-pg0-1: Can't find an address, check slurm.conf srun: error: Application launch failed: Can't find an address, check slurm.conf srun: Job step aborted
I cannot ping the node from the Scheduler node:
[azccadmin@SlurmCluster-1-scheduler ~]$ ping slurmcluster-1-hpc-pg0-1 ping: slurmcluster-1-hpc-pg0-1: Name or service not known
Slurm is trying to talk to the node as "slurmcluster-1-hpc-pg0-1", but the hostname of the node is actually "slurmcluster-1-slurmcluster-1-hpc-pg0-1"
And I can ping it with this name from the Scheduler node:
[azccadmin@SlurmCluster-1-scheduler data]$ ping slurmcluster-1-slurmcluster-1-hpc-pg0-1 PING slurmcluster-1-slurmcluster-1-hpc-pg0-1.yr5ran05dk5uzlz13kqz0cq4xe.ax.internal.cloudapp.net (192.168.140.6) 56(84) bytes of data. 64 bytes from slurmcluster-1-slurmcluster-1-hpc-pg0-1.internal.cloudapp.net (192.168.140.6): icmp_seq=1 ttl=64 time=1.77 ms
I am using CycleCloud 8.3 - which I note has a fix for Slurm NodeName / Azure Hostname (but this seems to still be an issue)?
Thanks
Gary