srun: error: Unable to resolve "mgmt": Unknown host

chaoyanghe commented 5 years ago

srun: error: Unable to resolve "mgmt": Unknown host srun: error: Unable to establish control machine address srun: error: Unable to confirm allocation for job 90: No error srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 90 slurmstepd: error: Unable to resolve "mgmt": Unknown host

When I run my job, it shows this. How to fix this issue?

christopheredsall commented 5 years ago

Thanks for your patience when encountering all these issues.

Root cause

It seems that the root cause for this is the Oracle Linux DHCP updating logic in /etc/dhcp/exit-hooks.d/dhclient-exit-hook-set-hostname.sh has gone awry. It is only supposed to change the "search" line in /etc/resolve.conf if in needs to be different, but every time the DHCP lease is renewed it appends a new line in the file like:

; Any changes made to this file will be overwritten whenever the
; DHCP lease is renewed. To persist changes you must update the
; /etc/oci-hostname.conf file. For more information see
;[https://docs.cloud.oracle.com/iaas/Content/Network/Tasks/managingDHCP.htm#notes]
;
; generated by /usr/sbin/dhclient-script
search clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com
nameserver 169.254.169.254
search clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com

search subnetad2.clustervcn.oraclevcn.com clustervcn.oraclevcn.com
search subnetad2.clustervcn.oraclevcn.com clustervcn.oraclevcn.com
search subnetad2.clustervcn.oraclevcn.com clustervcn.oraclevcn.com
search subnetad2.clustervcn.oraclevcn.com clustervcn.oraclevcn.com

None of those last four lines should be there. The last search line takes precedence.

Fix

We have a fix in development that removes the need for three separate subnets and therefore eliminates the problems of DNS resolution in different subnets.

Workaround

However to get your existing cluster working properly we can fix /etc/resolv.conf and prevent dhclient from changing it again. Finally we'll need to restart slurmd on the compute nodes to get it talking to the mgmt node again.

I'm just trying to come up with a procedure for that

chaoyanghe commented 5 years ago

@christopheredsall Cool! I am glad you can find the crux of the problem. Please let me know if you already fix this issue in my specific cluster. I am waiting for this cluster to rush my deadline:-)

christopheredsall commented 5 years ago

OK, try this as the opc user on the management node

clush -w @compute,@role:mgmt sudo sed -i -e "/PRESERVE_HOSTINFO/ s/0/2/" /etc/oci-hostname.conf
clush -w @compute,@role:mgmt sudo sed -i -e "\$asearch clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com"
clush -w @compute sudo systemctl restart slurmd

chaoyanghe commented 5 years ago

mgmt: Warning: Permanently added 'mgmt,10.1.0.3' (ECDSA) to the list of known hosts. vm-standard2-24-ad2-0005: ssh: Could not resolve hostname vm-standard2-24-ad2-0005: Name or service not known clush: vm-standard2-24-ad2-0005: exited with exit code 255 vm-standard2-24-ad3-0001: ssh: Could not resolve hostname vm-standard2-24-ad3-0001: Name or service not known clush: vm-standard2-24-ad3-0001: exited with exit code 255 vm-standard2-24-ad3-0002: ssh: Could not resolve hostname vm-standard2-24-ad3-0002: Name or service not known clush: vm-standard2-24-ad3-0002: exited with exit code 255 vm-standard2-24-ad2-0006: ssh: Could not resolve hostname vm-standard2-24-ad2-0006: Name or service not known clush: vm-standard2-24-ad2-0006: exited with exit code 255 vm-standard2-24-ad3-0003: ssh: Could not resolve hostname vm-standard2-24-ad3-0003: Name or service not known clush: vm-standard2-24-ad3-0003: exited with exit code 255 mgmt: sed: -e expression #1, char 19: missing command clush: mgmt: exited with exit code 1

chaoyanghe commented 5 years ago

christopheredsall commented 5 years ago

Ah, looks like the management node can't resolve names in AD3

We'll need those three lines in a different order. I put the PRESERVE_HOSTINFO line first in case it tried to change the file while we were editing it. Try the second line, then the first one, then the third:

clush -w @compute,@role:mgmt sudo sed -i -e "\$asearch clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com"
clush -w @compute,@role:mgmt sudo sed -i -e "/PRESERVE_HOSTINFO/ s/0/2/" /etc/oci-hostname.conf
clush -w @compute sudo systemctl restart slurmd

chaoyanghe commented 5 years ago

christopheredsall commented 5 years ago

I made two errors, the filename is missing off the sed command. And I think we will have to do the managment node first on it's own, then the compute nodes

clush -w @role:mgmt sudo sed -i -e "\$asearch clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com" /etc/resolv.conf
clush -w @compute,@role:mgmt sudo sed -i -e "\$asearch clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com" /etc/resolv.conf
clush -w @compute,@role:mgmt sudo sed -i -e "/PRESERVE_HOSTINFO/ s/0/2/" /etc/oci-hostname.conf
clush -w @compute sudo systemctl restart slurmd

Also it's getting a bit late, local time here. we may have to pick this up tomorrow.

chaoyanghe commented 5 years ago

[opc@mgmt ~]$ clush -w @role:mgmt sudo sed -i -e "\$asearch clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com" /etc/resolv.conf mgmt: sed: can't read subnetad1.clustervcn.oraclevcn.com: No such file or directory mgmt: sed: can't read subnetad2.clustervcn.oraclevcn.com: No such file or directory mgmt: sed: can't read subnetad3.clustervcn.oraclevcn.com: No such file or directory clush: mgmt: exited with exit code 2 [opc@mgmt ~]$

jtsaismith commented 5 years ago

I manually edited /etc/resolv.conf - it was quite mangled, and missing the nameserver. It now contains: _search clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com nameserver 169.254.169.254 search clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com__

I also manually edited /etc/oci-hostname.conf. It now contains: _PRESERVEHOSTINFO=2

I'm now able to ping the internet (e.g., google.com, github.com, etc.).

milliams commented 4 years ago

We have now made a change so that there is a single subnet containing all ADs so the DNS search path is now simplified. It was fixed in #16.

clusterinthecloud / terraform

srun: error: Unable to resolve "mgmt": Unknown host #23

Root cause

Fix

Workaround