Closed vschnei closed 2 weeks ago
Interesting. Can you share your BiBiGrid configuration (bibigrid.yml
) with me and execute the create process again with the additional arguments -vv -d
. -vv
sets the logging level to debug, -d
prevents the cluster from shutting down once an error appears, but waits instead and also prints out the full stack trace then. Please update the outputs accordingly.
-d
allows us to try to connect to the started but failed master manually via ssh -i /home/xaver/.config/bibigrid/keys/tempKey_bibi-{cluster_id}
ubuntu@{master_ip}`.
This is not an OpenStack authentication issue, but an ssh connection issue.
Here are the logs
./bibigrid.sh -i bibigrid.yml -c -vv -d
2024-06-06 15:23:07,771 [DEBUG] Logging verbosity set to 2
2024-06-06 15:23:07,776 [DEBUG] File clouds.yaml found in folder ~/.config/bibigrid.
2024-06-06 15:23:07,778 [DEBUG] File clouds-public.yaml not found in folder ~/.config/bibigrid.
2024-06-06 15:23:07,778 [DEBUG] File clouds-public.yaml not found in folder /etc/bibigrid.
2024-06-06 15:23:07,779 [DEBUG] File clouds-public.yaml not found in folder .
2024-06-06 15:23:07,779 [DEBUG] Loaded clouds.yml and clouds_public.yml
2024-06-06 15:23:07,779 [DEBUG] Using only clouds.yaml since no clouds-public profile is set.
2024-06-06 15:23:09,253 [INFO] Action create selected
2024-06-06 15:23:11,856 [DEBUG] Cluster-ID: 3k2ifh8oi96cwe3
2024-06-06 15:23:11,856 [DEBUG] Keyname: tempKey_bibi-3k2ifh8oi96cwe3
2024-06-06 15:23:11,856 [PRINT] Creating a new cluster takes about 10 or more minutes depending on your cloud provider and your configuration. Be patient.
2024-06-06 15:23:11,856 [INFO] Generating keypair
2024-06-06 15:23:11,873 [DEBUG] Generating public/private ecdsa key pair.
Your identification has been saved in /home/ubuntu/.config/bibigrid/keys/tempKey_bibi-3k2ifh8oi96cwe3
Your public key has been saved in /home/ubuntu/.config/bibigrid/keys/tempKey_bibi-3k2ifh8oi96cwe3.pub
The key fingerprint is:
SHA256:n3TmYaI/vMvphRz++VQ6z/Oq83dqPnX9hcJ1vFYDV0I ubuntu@slurm-setup
The key's randomart image is:
+---[ECDSA 256]---+
| .E o|
| . o |
| o. |
| ..+|
| S +.=. o*|
| * Xo..==|
| ..* o.=.+|
| oo+.oo=+|
| .B++B*=B|
+----[SHA256]-----+
2024-06-06 15:23:11,977 [DEBUG] No network found. Getting network by subnet.
2024-06-06 15:23:12,103 [DEBUG] Getting subnets by network.
2024-06-06 15:23:12,537 [INFO] Generating Security Groups
2024-06-06 15:23:15,119 [INFO] Starting instance/server bibigrid-master-3k2ifh8oi96cwe3 on openstack
2024-06-06 15:23:37,445 [INFO] Ansible preparation...
2024-06-06 15:23:40,474 [INFO] Attempting to connect to 172.17.1.50... This might take a while
2024-06-06 15:23:41,477 [INFO] Attempting to connect to 172.17.1.50... This might take a while
2024-06-06 15:23:43,480 [INFO] Attempting to connect to 172.17.1.50... This might take a while
2024-06-06 15:23:48,504 [INFO] Attempting to connect to 172.17.1.50... This might take a while
2024-06-06 15:23:56,696 [ERROR] Traceback (most recent call last):
File "/home/ubuntu/bibigrid/bibigrid/core/actions/create.py", line 378, in create
self.initialize_instances()
File "/home/ubuntu/bibigrid/bibigrid/core/actions/create.py", line 236, in initialize_instances
ssh_handler.ansible_preparation(floating_ip=configuration["floating_ip"],
File "/home/ubuntu/bibigrid/bibigrid/core/utility/handler/ssh_handler.py", line 207, in ansible_preparation
execute_ssh(floating_ip, private_key, username, log, gateway, commands, filepaths)
File "/home/ubuntu/bibigrid/bibigrid/core/utility/handler/ssh_handler.py", line 227, in execute_ssh
is_active(client=client, floating_ip_address=floating_ip, username=username, private_key=paramiko_key,
File "/home/ubuntu/bibigrid/bibigrid/core/utility/handler/ssh_handler.py", line 112, in is_active
client.connect(hostname=gateway.get("ip") or floating_ip_address, username=username, pkey=private_key,
File "/home/ubuntu/.venv/bibigrid/lib/python3.8/site-packages/paramiko/client.py", line 485, in connect
self._auth(
File "/home/ubuntu/.venv/bibigrid/lib/python3.8/site-packages/paramiko/client.py", line 818, in _auth
raise saved_exception
File "/home/ubuntu/.venv/bibigrid/lib/python3.8/site-packages/paramiko/client.py", line 716, in _auth
self._transport.auth_publickey(username, pkey)
File "/home/ubuntu/.venv/bibigrid/lib/python3.8/site-packages/paramiko/transport.py", line 1674, in auth_publickey
return self.auth_handler.wait_for_response(my_event)
File "/home/ubuntu/.venv/bibigrid/lib/python3.8/site-packages/paramiko/auth_handler.py", line 248, in wait_for_response
raise e
paramiko.ssh_exception.AuthenticationException: Authentication failed: transport shut down or saw EOF
2024-06-06 15:23:56,696 [ERROR] Unexpected error: 'Authentication failed: transport shut down or saw EOF' (<class 'paramiko.ssh_exception.AuthenticationException'>) Contact a developer!)
DEBUG MODE: Any non-empty input to shutdown cluster 3k2ifh8oi96cwe3. Empty input to exit with cluster still alive:
2024-06-06 15:25:37,512 [PRINT] --- 2 minutes and 29.74 seconds ---
I was able to log into the masterVM. ssh -i ~/.config/bibigrid/keys/tempKey_bibi-3k2ifh8oi96cwe3 ubuntu@{IP}
# See https://cloud.denbi.de/wiki/Tutorials/BiBiGrid/ (after update)
# First configuration will be used for general cluster information and must include the master.
# All other configurations mustn't include another master, but exactly one vpnWorker instead (keys like master).
- infrastructure: openstack # former mode.
cloud: openstack # name of clouds.yaml entry
# -- BEGIN: GENERAL CLUSTER INFORMATION --
## sshPublicKeyFiles listed here will be added to access the cluster. A temporary key is created by bibigrid itself.
#sshPublicKeyFiles:
# - [add path to your ssh key]
## Volumes and snapshots that will be mounted to master
#masterMounts:
# - [mount one]
## Uncomment if you don't want assign a public ip to the master; for internal cluster
# useMasterWithPublicIp: False
# Other keys
# localFS: False
# localDNSlookup: False
zabbix: True
nfs: True
ide: True
useMasterAsCompute: False
# master configuration
masterInstance:
type: de.NBI default
image: Ubuntu-22.04
# -- END: GENERAL CLUSTER INFORMATION --
# worker configuration
workerInstances:
- type: de.NBI default
image: Ubuntu-22.04
count: 2
# Depends on cloud image
sshUser: ubuntu
# Depends on your project and cloud site
subnet: CloudPlatform-subnet-2
# Depends on the services that run on server start at your cloud site
# Suspend worker nodes after 9 hours being idle
slurmConf:
elastic_scheduling:
SuspendTime: 32400
Mhm, your configuration looks fine.
I am pretty sure it is not the timeout, but an error that occurs once the master is ready for the ssh connection, but for some reason doesn't accept the connection. Can you try to check if the problem persists if you use bibigrid's current dev
branch (git checkout dev
) and set sshTimeout: 12
in your bibigrid.yml
?
As I said earlier, I don't think the timeout is the issue, but better to confirm it. Also, use -vv
again because we have increased the number of logs during the ssh connection and that might include helpful information.
Switching to dev
branch did not improve anything.
See logs: ./bibigrid.sh -i bibigrid.yml -c -vv
2024-06-06 16:21:07,273 [DEBUG] Logging verbosity set to 2
2024-06-06 16:21:07,278 [DEBUG] File clouds.yaml found in folder ~/.config/bibigrid.
2024-06-06 16:21:07,281 [DEBUG] File clouds-public.yaml not found in folder ~/.config/bibigrid.
2024-06-06 16:21:07,281 [DEBUG] File clouds-public.yaml not found in folder /etc/bibigrid.
2024-06-06 16:21:07,281 [DEBUG] File clouds-public.yaml not found in folder .
2024-06-06 16:21:07,282 [DEBUG] Loaded clouds.yml and clouds_public.yml
2024-06-06 16:21:07,282 [DEBUG] Using only clouds.yaml since no clouds-public profile is set.
2024-06-06 16:21:08,703 [INFO] Action create selected
2024-06-06 16:21:11,483 [DEBUG] Cluster-ID: 4v723t5qa23xyci
2024-06-06 16:21:11,483 [DEBUG] Keyname: tempKey_bibi-4v723t5qa23xyci
2024-06-06 16:21:11,483 [PRINT] Creating a new cluster takes about 10 or more minutes depending on your cloud provider and your configuration. Please be patient.
2024-06-06 16:21:11,483 [INFO] Generating keypair
2024-06-06 16:21:11,501 [DEBUG] Generating public/private ecdsa key pair.
Your identification has been saved in /home/ubuntu/.config/bibigrid/keys/tempKey_bibi-4v723t5qa23xyci
Your public key has been saved in /home/ubuntu/.config/bibigrid/keys/tempKey_bibi-4v723t5qa23xyci.pub
The key fingerprint is:
SHA256:xR2qHuDac4JQQkzdO7JA8FSKlA1u+HkDZ1RBegF7Iw0 ubuntu@slurm-setup
The key's randomart image is:
+---[ECDSA 256]---+
|.*BoE==. . |
|+*o+.=.. . o . |
|o+= B *. + . |
|...B.=oo o |
| +.oo..S |
| o.= . . |
| o + o |
| + |
| |
+----[SHA256]-----+
2024-06-06 16:21:11,603 [DEBUG] No network found. Getting network by subnet.
2024-06-06 16:21:11,733 [DEBUG] Getting subnets by network.
2024-06-06 16:21:12,153 [DEBUG] Creating default files
2024-06-06 16:21:12,153 [DEBUG] Copying ansible.cfg
2024-06-06 16:21:12,154 [DEBUG] Copying slurm.conf
2024-06-06 16:21:12,155 [INFO] Generating Security Groups
2024-06-06 16:21:13,742 [DEBUG] Writing yaml /home/ubuntu/bibigrid/resources/playbook/vars/hosts.yml
2024-06-06 16:21:13,742 [ERROR] Tried to access resource files but couldn't. No such file or directory: [Errno 2] No such file or directory: '/home/ubuntu/bibigrid/resources/playbook/vars/hosts.yml'
2024-06-06 16:21:13,743 [INFO] Deleting Keypair locally...
2024-06-06 16:21:13,743 [INFO] Terminating cluster 4v723t5qa23xyci on cloud openstack
2024-06-06 16:21:14,527 [INFO] Deleting servers on provider openstack...
2024-06-06 16:21:14,528 [INFO] Deleting Keypair on provider openstack...
2024-06-06 16:21:14,596 [INFO] Keypair tempKey_bibi-4v723t5qa23xyci deleted on provider openstack.
2024-06-06 16:21:14,596 [INFO] Deleting security groups on provider openstack...
2024-06-06 16:21:15,309 [INFO] Delete security_group default-4v723t5qa23xyci -> True on openstack.
2024-06-06 16:21:15,310 [INFO] Because you used application credentials to authenticate, no created application credentials need deletion.
2024-06-06 16:21:15,310 [WARNING] Unable to find any servers for cluster-id 4v723t5qa23xyci. Check cluster-id and configuration.
All keys deleted: True
2024-06-06 16:21:15,310 [PRINT] --- 0 minutes and 8.03 seconds ---
Can you git pull
and retry? The dev
error is fixed now.
Hi, your changes in the dev branch improved the deployment. Though there is still a problem: I show only the bottom of the logs.
2024-06-07 09:17:02,802 [DEBUG] REMOTE: TASK [bibigrid : Disable and Stop systemd-resolve] *****************************
2024-06-07 09:17:04,650 [DEBUG] REMOTE: changed: [localhost]
2024-06-07 09:17:04,672 [DEBUG] REMOTE:
2024-06-07 09:17:04,673 [DEBUG] REMOTE: TASK [bibigrid : Remove /etc/resolv.conf] **************************************
2024-06-07 09:17:05,148 [DEBUG] REMOTE: changed: [localhost]
2024-06-07 09:17:05,166 [DEBUG] REMOTE
2024-06-07 09:17:05,167 [DEBUG] REMOTE: TASK [bibigrid : Install dnsmasq] **********************************************
2024-06-07 09:17:14,869 [DEBUG] REMOTE: fatal: [localhost]: FAILED! => {"cache_update_time": 1717744608, "cache_updated": false, "changed": false, "msg": "'/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\" install 'dnsmasq=2.90-0ubuntu0.22.04.1'' failed: E: Failed to fetch http://nova.clouds.archive.ubuntu.com/ubuntu/pool/main/d/dns-root-data/dns-root-data_2023112702%7eubuntu0.22.04.1_all.deb Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/d/dnsmasq/dnsmasq-base_2.90-0ubuntu0.22.04.1_amd64.deb Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/d/dnsmasq/dnsmasq_2.90-0ubuntu0.22.04.1_all.deb Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?\n", "rc": 100, "stderr": "E: Failed to fetch http://nova.clouds.archive.ubuntu.com/ubuntu/pool/main/d/dns-root-data/dns-root-data_2023112702%7eubuntu0.22.04.1_all.deb Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/d/dnsmasq/dnsmasq-base_2.90-0ubuntu0.22.04.1_amd64.deb Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/d/dnsmasq/dnsmasq_2.90-0ubuntu0.22.04.1_all.deb Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?\n", "stderr_lines": ["E: Failed to fetch http://nova.clouds.archive.ubuntu.com/ubuntu/pool/main/d/dns-root-data/dns-root-data_2023112702%7eubuntu0.22.04.1_all.deb Temporary failure resolving 'nova.clouds.archive.ubuntu.com'", "E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/d/dnsmasq/dnsmasq-base_2.90-0ubuntu0.22.04.1_amd64.deb Temporary failure resolving 'nova.clouds.archive.ubuntu.com'", "E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/d/dnsmasq/dnsmasq_2.90-0ubuntu0.22.04.1_all.deb Temporary failure resolving 'nova.clouds.archive.ubuntu.com'", "E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?"], "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nThe following additional packages will be installed:\n dns-root-data dnsmasq-base\nSuggested packages:\n resolvconf\nThe following NEW packages will be installed:\n dns-root-data dnsmasq dnsmasq-base\n0 upgraded, 3 newly installed, 0 to remove and 3 not upgraded.\nNeed to get 399 kB of archives.\nAfter this operation, 1025 kB of additional disk space will be used.\nIgn:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1\nIgn:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1\nIgn:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1\nIgn:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1\nIgn:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1\nIgn:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1\nIgn:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1\nIgn:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1\nIgn:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1\nErr:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1\n Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nIgn:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1\nIgn:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1\nErr:2 http://security.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1\n Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nErr:3 http://security.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1\n Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\n", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "The following additional packages will be installed:", " dns-root-data dnsmasq-base", "Suggested packages:", " resolvconf", "The following NEW packages will be installed:", " dns-root-data dnsmasq dnsmasq-base", "0 upgraded, 3 newly installed, 0 to remove and 3 not upgraded.", "Need to get 399 kB of archives.", "After this operation, 1025 kB of additional disk space will be used.", "Ign:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1", "Ign:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1", "Ign:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1", "Ign:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1", "Ign:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1", "Ign:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1", "Ign:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1", "Ign:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1", "Ign:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1", "Err:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1", " Temporary failure resolving 'nova.clouds.archive.ubuntu.com'", "Ign:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1", "Ign:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1", "Err:2 http://security.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1", " Temporary failure resolving 'nova.clouds.archive.ubuntu.com'", "Err:3 http://security.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1", " Temporary failure resolving 'nova.clouds.archive.ubuntu.com'"]}
2024-06-07 09:17:14,873 [DEBUG] REMOTE:
2024-06-07 09:17:14,873 [DEBUG] REMOTE: PLAY RECAP *********************************************************************
2024-06-07 09:17:14,873 [DEBUG] REMOTE: localhost : ok=18 changed=12 unreachable=0 failed=1 skipped=18 rescued=0 ignored=0
2024-06-07 09:17:14,874 [DEBUG] REMOTE:
2024-06-07 09:17:15,058 [WARNING] Execute ansible playbook. Be patient. ... Exit status: 2
2024-06-07 09:17:15,058 [ERROR] Execution of cmd on remote host fails: Execute ansible playbook. Be patient. ... Exit status: 2
2024-06-07 09:17:15,059 [INFO] Deleting Keypair locally...
2024-06-07 09:17:15,059 [INFO] Terminating cluster 5jyhqcz1fpe7ipq on cloud openstack
2024-06-07 09:17:16,472 [INFO] Deleting servers on provider openstack...
2024-06-07 09:17:16,474 [INFO] Trying to terminate Server bibigrid-master-5jyhqcz1fpe7ipq on cloud openstack.
2024-06-07 09:17:19,319 [INFO] Server bibigrid-master-5jyhqcz1fpe7ipq terminated on provider openstack.
2024-06-07 09:17:19,319 [INFO] Deleting Keypair on provider openstack...
2024-06-07 09:17:19,422 [INFO] Keypair tempKey_bibi-5jyhqcz1fpe7ipq deleted on provider openstack.
2024-06-07 09:17:19,423 [INFO] Deleting security groups on provider openstack...
2024-06-07 09:17:22,909 [INFO] Retrying to delete security group default-5jyhqcz1fpe7ipq on openstack. Attempt 1/5
2024-06-07 09:17:23,620 [INFO] Delete security_group default-5jyhqcz1fpe7ipq -> True on openstack.
2024-06-07 09:17:23,620 [INFO] Because you used application credentials to authenticate, no created application credentials need deletion.
2024-06-07 09:17:23,620 [INFO] Terminated all servers of cluster 5jyhqcz1fpe7ipq.
2024-06-07 09:17:23,621 [INFO] Deleted all keypairs of cluster 5jyhqcz1fpe7ipq.
2024-06-07 09:17:23,621 [INFO] Deleted all security groups of cluster 5jyhqcz1fpe7ipq.
2024-06-07 09:17:23,621 [PRINT] Successfully terminated cluster 5jyhqcz1fpe7ipq.
2024-06-07 09:17:23,621 [INFO] Successfully handled application credential of cluster 5jyhqcz1fpe7ipq.
2024-06-07 09:17:23,622 [PRINT] --- 22 minutes and 12.8 seconds ---
Out of interest: Could you also show me how many attempts paramiko needed to connect? The Attempting to connect ...
lines. I would like to assess whether that was actually the problem.
Your resolving problem might have to do with your cloud location. Maybe a connectivity issue or maybe a post-launch service that interferes (see waitForServices
). Sadly, those services are often not well documented because waiting for them is only necessary when the startup happens very fast. I will contact Jan, a colleague, regarding possible cloud location issues.
I would recommend to try the following in the meantime even though it is more a workaround than anything:
-d
.bibiplay -l master
which basically executes the master setup via ansible playbook manually. This is now possible because the failure happens after copying all the necessary files onto the master.I apologize for the trouble.
Hey, you do not have to appologize. I am happy that you are so quick about it.
Here the connection attemt information:
2024-06-07 11:52:17,160 [INFO] Attempting to connect to 172.17.1.109... This might take a while
2024-06-07 11:52:17,160 [INFO] Attempt 0/12. Connecting to 172.17.1.109
2024-06-07 11:52:24,190 [INFO] Waiting 4 before attempting to reconnect.
2024-06-07 11:52:24,191 [INFO] Attempt 1/12. Connecting to 172.17.1.109
2024-06-07 11:52:32,200 [INFO] Waiting 8 before attempting to reconnect.
2024-06-07 11:52:32,201 [INFO] Attempt 2/12. Connecting to 172.17.1.109
2024-06-07 11:52:48,218 [INFO] Waiting 16 before attempting to reconnect.
2024-06-07 11:52:48,219 [INFO] Attempt 3/12. Connecting to 172.17.1.109
2024-06-07 11:52:48,342 [INFO] Successfully connected to 172.17.1.109.
2024-06-07 11:52:48,342 [DEBUG] Setting up 172.17.1.109
2024-06-07 11:52:48,342 [DEBUG] Setting up filepaths for 172.17.1.109
2024-06-07 11:52:49,862 [DEBUG] Copy /home/ubuntu/.config/bibigrid/keys/tempKey_bibi-hotdkesuzdqzjg3 to .ssh/id_ecdsa...
From the master VM i was able to execute bibiplay -l master
but not without an error:
PLAY [master] *******************************************************************************************************************************************************************
TASK [Gathering Facts] **********************************************************************************************************************************************************
ok: [localhost]
TASK [bibigrid : Running 000-add-ip-routes.yml] *********************************************************************************************************************************
skipping: [localhost]
TASK [bibigrid : Collect files] *************************************************************************************************************************************************
skipping: [localhost]
TASK [bibigrid : Copy files] ****************************************************************************************************************************************************
skipping: [localhost]
TASK [bibigrid : Remove collected files] ****************************************************************************************************************************************
skipping: [localhost]
TASK [bibigrid : Disable cloud network changes after initialization] ************************************************************************************************************
skipping: [localhost]
TASK [bibigrid : Generate location specific worker userdata] ********************************************************************************************************************
skipping: [localhost]
TASK [bibigrid : Generate location specific worker userdata] ********************************************************************************************************************
skipping: [localhost]
TASK [bibigrid : Running 000-playbook-rights-server.yml] ************************************************************************************************************************
ok: [localhost] => {
"msg": "[BIBIGRID] Update permissions"
}
TASK [bibigrid : Assure existence of ansible group] *****************************************************************************************************************************
ok: [localhost]
TASK [bibigrid : Change mode of /opt/slurm directory] ***************************************************************************************************************************
ok: [localhost]
TASK [bibigrid : Running 001-apt.yml] *******************************************************************************************************************************************
ok: [localhost] => {
"msg": "[BIBIGRID] Setup common software and dependencies"
}
TASK [bibigrid : Debian based system] *******************************************************************************************************************************************
ok: [localhost] => {
"msg": "Using apt to install packages"
}
TASK [bibigrid : Disable auto-update/upgrade during ansible-run] ****************************************************************************************************************
ok: [localhost]
TASK [bibigrid : Wait for cloud-init / user-data to finish] *********************************************************************************************************************
ok: [localhost]
TASK [bibigrid : Wait for /var/lib/dpkg/lock-frontend to be released] ***********************************************************************************************************
changed: [localhost]
TASK [bibigrid : Wait for post-launch services to stop] *************************************************************************************************************************
skipping: [localhost]
TASK [bibigrid : Update] ********************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}
PLAY RECAP **********************************************************************************************************************************************************************
localhost : ok=9 changed=1 unreachable=0 failed=1 skipped=8 rescued=0 ignored=0
I am afraid you encounter a cloud site issue. My colleague is currently looking into his own Berlin project to see if we can do anything on bibigrid's side to mitigate it.
You could try bibiplay -l master -vvvvvvv
for some additional information (please share that log here). Both to see if the playbook fails at the same task and whether there is some helpful additional information in ansible's debug log.
If it is feasible for you, all three of us can also hold a Zoom session together to look at the running but failed instance and find the problem.
Let's see if your colleague is able to identify the problem.
Here is the last part of bibiplay -l master -vvvvvvv
It looks like there is a problem in name resolution
<localhost> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=60 -o 'ControlPath="/home/ubuntu/.ansible/cp/a64ccf8ffb"' localhost '/bin/sh -c '"'"'sudo -H -S -n -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-ccwsqjfbwguuybwinzwhtlwgbdacpxzy ; /usr/bin/python3'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
<localhost> (1, b'\n{"failed": true, "msg": "Failed to update apt cache: unknown reason", "invocation": {"module_args": {"update_cache": true, "upgrade": "yes", "state": "present", "update_cache_retries": 5, "update_cache_retry_max_delay": 12, "cache_valid_time": 0, "purge": false, "force": false, "dpkg_options": "force-confdef,force-confold", "autoremove": false, "autoclean": false, "fail_on_autoremove": false, "only_upgrade": false, "force_apt_get": false, "clean": false, "allow_unauthenticated": false, "allow_downgrade": false, "allow_change_held_packages": false, "lock_timeout": 60, "package": null, "deb": null, "default_release": null, "install_recommends": null, "policy_rc_d": null}}}\n', b"OpenSSH_8.9p1 Ubuntu-3ubuntu0.7, OpenSSL 3.0.2 15 Mar 2022\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files\r\ndebug1: /etc/ssh/ssh_config line 21: Applying options for *\r\ndebug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/home/ubuntu/.ssh/known_hosts'\r\ndebug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/home/ubuntu/.ssh/known_hosts2'\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 21057\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\nsudo: unable to resolve host bibigrid-master-hotdkesuzdqzjg3: Temporary failure in name resolution\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 1\r\n")
<localhost> Failed to connect to the host via ssh: OpenSSH_8.9p1 Ubuntu-3ubuntu0.7, OpenSSL 3.0.2 15 Mar 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/home/ubuntu/.ssh/known_hosts'
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/home/ubuntu/.ssh/known_hosts2'
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 21057
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
sudo: unable to resolve host bibigrid-master-hotdkesuzdqzjg3: Temporary failure in name resolution
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 1
fatal: [localhost]: FAILED! => {
"changed": false,
"invocation": {
"module_args": {
"allow_change_held_packages": false,
"allow_downgrade": false,
"allow_unauthenticated": false,
"autoclean": false,
"autoremove": false,
"cache_valid_time": 0,
"clean": false,
"deb": null,
"default_release": null,
"dpkg_options": "force-confdef,force-confold",
"fail_on_autoremove": false,
"force": false,
"force_apt_get": false,
"install_recommends": null,
"lock_timeout": 60,
"only_upgrade": false,
"package": null,
"policy_rc_d": null,
"purge": false,
"state": "present",
"update_cache": true,
"update_cache_retries": 5,
"update_cache_retry_max_delay": 12,
"upgrade": "yes"
}
},
"msg": "Failed to update apt cache: unknown reason"
}
PLAY RECAP **********************************************************************************************************************************************************************
localhost : ok=9 changed=1 unreachable=0 failed=1 skipped=8 rescued=0 ignored=0
If it is feasible for you, all three of us can also hold a Zoom session together to look at the running but failed instance and find the problem. Currently, it looks like something is off with the cloud's DNS which we can't really fix.
The mistake was on our side. I am currently fixing it and will come back to you as soon as possible.
Please try creating a new cluster using the branch 509-dns-breaks-down-hotfix
. The issue was that we removed /etc/resolv.conf
before installing dnsmasq
(our preferred dns solution). For cloud site specific reasons this didn't cause any issue on the Bielefeld cloud, but does so on many other cloud sites.
Thank you for bringing this to our attention!
EDIT: And please report back if that fixed it.
I have tried the new branch and the previously reported problem did not appear anymore.
After restarting, the deployment stuck at the step: "Installing Docker". So I removed docker installation from resources/playbook/roles/bibigrid/tasks/main.yml
, since it is not essential for the cluster deployment, at least to my opinion.
Now the deployment stuck at:
REMOTE: TASK [bibigrid : Start slurm explicit after all dependencies are configured] ***
REMOTE: fatal: [localhost]: FAILED! => {"changed": false, "msg": "Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee \"systemct
l status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for details.\n"}:
error: Parse error in file /etc/slurm/slurm.conf line 83: "SuspendExcNodes="
The problem is /etc/slurm/slurm.conf
where SuspendExcNodes=
is commented out but not value is assigned.
The template should assign a list to SuspendExcNodes
containing all workers and the master where on_demand: false
. This should always include the master. Can you send me your bibigrid/resources/playbook/group_vars/*
? In the master file on_demand: false
should be set.
The initial issue has been solved by adding more attempts to the ssh timeout. Other issues have then be discussed and later moved to chat.
Aim
Hi BibGrid-team, I am about to setup a slurm cluster on the Berlin node using bibiGrid.
The cofiguration following the Tutorial worked great unti the actual start of the cluster.
Detailed behaviour
It looks as if my authentication fails though I am able to navigate in openstack via command line.
Any ideas what could have gone wrong in my setup.
best, Valentin