Closed chaoyanghe closed 4 years ago
Thanks for your patience when encountering all these issues.
It seems that the root cause for this is the Oracle Linux DHCP updating logic in /etc/dhcp/exit-hooks.d/dhclient-exit-hook-set-hostname.sh
has gone awry. It is only supposed to change the "search" line in /etc/resolve.conf
if in needs to be different, but every time the DHCP lease is renewed it appends a new line in the file like:
; Any changes made to this file will be overwritten whenever the
; DHCP lease is renewed. To persist changes you must update the
; /etc/oci-hostname.conf file. For more information see
;[https://docs.cloud.oracle.com/iaas/Content/Network/Tasks/managingDHCP.htm#notes]
;
; generated by /usr/sbin/dhclient-script
search clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com
nameserver 169.254.169.254
search clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com
search subnetad2.clustervcn.oraclevcn.com clustervcn.oraclevcn.com
search subnetad2.clustervcn.oraclevcn.com clustervcn.oraclevcn.com
search subnetad2.clustervcn.oraclevcn.com clustervcn.oraclevcn.com
search subnetad2.clustervcn.oraclevcn.com clustervcn.oraclevcn.com
None of those last four lines should be there. The last search line takes precedence.
We have a fix in development that removes the need for three separate subnets and therefore eliminates the problems of DNS resolution in different subnets.
However to get your existing cluster working properly we can fix /etc/resolv.conf
and prevent dhclient
from changing it again. Finally we'll need to restart slurmd on the compute nodes to get it talking to the mgmt node again.
I'm just trying to come up with a procedure for that
@christopheredsall Cool! I am glad you can find the crux of the problem. Please let me know if you already fix this issue in my specific cluster. I am waiting for this cluster to rush my deadline:-)
OK, try this as the opc user on the management node
clush -w @compute,@role:mgmt sudo sed -i -e "/PRESERVE_HOSTINFO/ s/0/2/" /etc/oci-hostname.conf
clush -w @compute,@role:mgmt sudo sed -i -e "\$asearch clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com"
clush -w @compute sudo systemctl restart slurmd
mgmt: Warning: Permanently added 'mgmt,10.1.0.3' (ECDSA) to the list of known hosts. vm-standard2-24-ad2-0005: ssh: Could not resolve hostname vm-standard2-24-ad2-0005: Name or service not known clush: vm-standard2-24-ad2-0005: exited with exit code 255 vm-standard2-24-ad3-0001: ssh: Could not resolve hostname vm-standard2-24-ad3-0001: Name or service not known clush: vm-standard2-24-ad3-0001: exited with exit code 255 vm-standard2-24-ad3-0002: ssh: Could not resolve hostname vm-standard2-24-ad3-0002: Name or service not known clush: vm-standard2-24-ad3-0002: exited with exit code 255 vm-standard2-24-ad2-0006: ssh: Could not resolve hostname vm-standard2-24-ad2-0006: Name or service not known clush: vm-standard2-24-ad2-0006: exited with exit code 255 vm-standard2-24-ad3-0003: ssh: Could not resolve hostname vm-standard2-24-ad3-0003: Name or service not known clush: vm-standard2-24-ad3-0003: exited with exit code 255 mgmt: sed: -e expression #1, char 19: missing command clush: mgmt: exited with exit code 1
Ah, looks like the management node can't resolve names in AD3
We'll need those three lines in a different order. I put the PRESERVE_HOSTINFO line first in case it tried to change the file while we were editing it. Try the second line, then the first one, then the third:
clush -w @compute,@role:mgmt sudo sed -i -e "\$asearch clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com"
clush -w @compute,@role:mgmt sudo sed -i -e "/PRESERVE_HOSTINFO/ s/0/2/" /etc/oci-hostname.conf
clush -w @compute sudo systemctl restart slurmd
I made two errors, the filename is missing off the sed command. And I think we will have to do the managment node first on it's own, then the compute nodes
clush -w @role:mgmt sudo sed -i -e "\$asearch clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com" /etc/resolv.conf
clush -w @compute,@role:mgmt sudo sed -i -e "\$asearch clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com" /etc/resolv.conf
clush -w @compute,@role:mgmt sudo sed -i -e "/PRESERVE_HOSTINFO/ s/0/2/" /etc/oci-hostname.conf
clush -w @compute sudo systemctl restart slurmd
Also it's getting a bit late, local time here. we may have to pick this up tomorrow.
[opc@mgmt ~]$ clush -w @role:mgmt sudo sed -i -e "\$asearch clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com" /etc/resolv.conf mgmt: sed: can't read subnetad1.clustervcn.oraclevcn.com: No such file or directory mgmt: sed: can't read subnetad2.clustervcn.oraclevcn.com: No such file or directory mgmt: sed: can't read subnetad3.clustervcn.oraclevcn.com: No such file or directory clush: mgmt: exited with exit code 2 [opc@mgmt ~]$
I manually edited /etc/resolv.conf - it was quite mangled, and missing the nameserver. It now contains: _search clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com nameserver 169.254.169.254 search clustervcn.oraclevcn.com subnetad1.clustervcn.oraclevcn.com subnetad2.clustervcn.oraclevcn.com subnetad3.clustervcn.oraclevcn.com__
I also manually edited /etc/oci-hostname.conf. It now contains: _PRESERVEHOSTINFO=2
I'm now able to ping the internet (e.g., google.com, github.com, etc.).
We have now made a change so that there is a single subnet containing all ADs so the DNS search path is now simplified. It was fixed in #16.
srun: error: Unable to resolve "mgmt": Unknown host srun: error: Unable to establish control machine address srun: error: Unable to confirm allocation for job 90: No error srun: Check SLURM_JOB_ID environment variable. Expired or invalid job 90 slurmstepd: error: Unable to resolve "mgmt": Unknown host
When I run my job, it shows this. How to fix this issue?