BiBiServ / bibigrid

BiBiGrid is a tool for an easy cluster setup inside a cloud environment.
Apache License 2.0
11 stars 8 forks source link

SSH Authentication failed during cluster deployment #508

Closed vschnei closed 2 weeks ago

vschnei commented 1 month ago

Aim

Hi BibGrid-team, I am about to setup a slurm cluster on the Berlin node using bibiGrid.

The cofiguration following the Tutorial worked great unti the actual start of the cluster.

Detailed behaviour

./bibigrid.sh -i bibigrid.yml -ch -v

2024-06-06 14:47:08,763 [INFO] Action check selected
2024-06-06 14:47:08,764 [INFO] Validating config file...
2024-06-06 14:47:08,764 [INFO] Checking master/vpn
2024-06-06 14:47:08,764 [INFO] Checking master/vpn: Success
2024-06-06 14:47:08,765 [INFO] Checking servergroup: Success
2024-06-06 14:47:08,765 [INFO] Checking instance images and type
2024-06-06 14:47:10,659 [INFO] Instance masterInstance image: Ubuntu-22.04 found
2024-06-06 14:47:11,316 [INFO] Type de.NBI default has enough disk space: 0/20
2024-06-06 14:47:11,316 [INFO] Type de.NBI default has enough ram: 0/4096
2024-06-06 14:47:11,888 [INFO] Instance workerInstance image: Ubuntu-22.04 found
2024-06-06 14:47:12,571 [INFO] Type de.NBI default has enough disk space: 0/20
2024-06-06 14:47:12,571 [INFO] Type de.NBI default has enough ram: 0/4096
2024-06-06 14:47:12,572 [INFO] Checking instances: Success
2024-06-06 14:47:12,572 [INFO] Checking volumes...
2024-06-06 14:47:12,572 [INFO] Checking volumes: Success
2024-06-06 14:47:12,573 [INFO] Checking network...
2024-06-06 14:47:12,747 [INFO] Subnet 'CloudPlatform-subnet-2' found
2024-06-06 14:47:12,747 [INFO] Checking network: Success
2024-06-06 14:47:12,747 [INFO] Checking quotas
2024-06-06 14:47:12,748 [INFO] required/available
2024-06-06 14:47:12,809 [WARNING] The option [maxPersonality] has been deprecated. Please avoid using it.
2024-06-06 14:47:12,809 [WARNING] The option [maxPersonalitySize] has been deprecated. Please avoid using it.
2024-06-06 14:47:13,425 [INFO] Project openstack has enough total_cores: 4/82
2024-06-06 14:47:13,425 [WARNING] Project openstack returns no valid value for floating_ips: 1/-1 -- Ignored.
2024-06-06 14:47:13,426 [INFO] Project openstack has enough instances: 3/9
2024-06-06 14:47:13,426 [INFO] Project openstack has enough total_ram: 8192/167936
2024-06-06 14:47:13,426 [INFO] Project openstack has enough Volumes: 0/71
2024-06-06 14:47:13,426 [INFO] Project openstack has enough VolumeGigabytes: 0/10000
2024-06-06 14:47:13,426 [INFO] Project openstack has enough Snapshots: 0/9
2024-06-06 14:47:13,426 [INFO] Project openstack has enough Backups: 0/10
2024-06-06 14:47:13,427 [INFO] Project openstack has enough BackupGigabytes: 0/1000
2024-06-06 14:47:13,427 [INFO] Checking quotas: Success
2024-06-06 14:47:13,427 [INFO] Checking sshPublicKeyFiles: Success
2024-06-06 14:47:13,427 [INFO] Checking cloud specifications...
2024-06-06 14:47:13,433 [INFO] Checking validity of entire clouds.yaml and clouds-public.yaml
2024-06-06 14:47:13,438 [INFO] Checking cloudYamls: Success
2024-06-06 14:47:13,439 [INFO] Checking nfs...
2024-06-06 14:47:13,439 [INFO] Checking nfs: Success
2024-06-06 14:47:13,439 [PRINT] Total check succeeded! Cluster is ready to start.
2024-06-06 14:47:13,439 [INFO] Total check returned True.
2024-06-06 14:47:13,439 [PRINT] --- 0 minutes and 6.04 seconds ---

(bibigrid) ubuntu@slurm-setup:~/bibigrid$ ./bibigrid.sh -i bibigrid.yml -c -v
2024-06-06 14:47:41,322 [INFO] Action create selected
2024-06-06 14:47:44,030 [PRINT] Creating a new cluster takes about 10 or more minutes depending on your cloud provider and your configuration. Be patient.
2024-06-06 14:47:44,031 [INFO] Generating keypair
2024-06-06 14:47:44,713 [INFO] Generating Security Groups
2024-06-06 14:47:47,344 [INFO] Starting instance/server bibigrid-master-z50khsxphwj12fd on openstack
2024-06-06 14:48:10,823 [INFO] Ansible preparation...
2024-06-06 14:48:13,849 [INFO] Attempting to connect to 172.17.1.29... This might take a while
2024-06-06 14:48:14,852 [INFO] Attempting to connect to 172.17.1.29... This might take a while
2024-06-06 14:48:16,856 [INFO] Attempting to connect to 172.17.1.29... This might take a while
2024-06-06 14:48:21,880 [INFO] Attempting to connect to 172.17.1.29... This might take a while
2024-06-06 14:48:30,019 [ERROR] Unexpected error: 'Authentication failed: transport shut down or saw EOF' (<class 'paramiko.ssh_exception.AuthenticationException'>) Contact a developer!)
2024-06-06 14:48:30,020 [INFO] Deleting Keypair locally...
2024-06-06 14:48:30,020 [INFO] Terminating cluster z50khsxphwj12fd on cloud openstack
2024-06-06 14:48:30,818 [INFO] Deleting servers on provider openstack...
2024-06-06 14:48:30,819 [INFO] Trying to terminate Server bibigrid-master-z50khsxphwj12fd on cloud openstack.
2024-06-06 14:48:33,227 [INFO] Server bibigrid-master-z50khsxphwj12fd terminated on provider openstack.
2024-06-06 14:48:33,228 [INFO] Deleting Keypair on provider openstack...
2024-06-06 14:48:33,331 [INFO] Keypair tempKey_bibi-z50khsxphwj12fd deleted on provider openstack.
2024-06-06 14:48:33,331 [INFO] Deleting security groups on provider openstack...
2024-06-06 14:48:36,564 [INFO] Retrying to delete security group default-z50khsxphwj12fd on openstack. Attempt 1/5
2024-06-06 14:48:37,082 [INFO] Delete security_group default-z50khsxphwj12fd -> True
2024-06-06 14:48:37,083 [INFO] Because you used application credentials to authenticate, no created application credentials need deletion.
2024-06-06 14:48:37,083 [INFO] Terminated all servers of cluster z50khsxphwj12fd.
2024-06-06 14:48:37,083 [INFO] Deleted all keypairs of cluster z50khsxphwj12fd.
2024-06-06 14:48:37,083 [INFO] Deleted all security groups of cluster z50khsxphwj12fd.
2024-06-06 14:48:37,083 [PRINT] Successfully terminated cluster z50khsxphwj12fd.
2024-06-06 14:48:37,084 [INFO] Successfully handled application credential of cluster z50khsxphwj12fd.
2024-06-06 14:48:37,084 [PRINT] --- 0 minutes and 57.14 seconds ---

It looks as if my authentication fails though I am able to navigate in openstack via command line.

Any ideas what could have gone wrong in my setup.

best, Valentin

XaverStiensmeier commented 1 month ago

Interesting. Can you share your BiBiGrid configuration (bibigrid.yml) with me and execute the create process again with the additional arguments -vv -d. -vv sets the logging level to debug, -d prevents the cluster from shutting down once an error appears, but waits instead and also prints out the full stack trace then. Please update the outputs accordingly.

-d allows us to try to connect to the started but failed master manually via ssh -i /home/xaver/.config/bibigrid/keys/tempKey_bibi-{cluster_id} ubuntu@{master_ip}`.

This is not an OpenStack authentication issue, but an ssh connection issue.

vschnei commented 1 month ago

Here are the logs

./bibigrid.sh -i bibigrid.yml -c -vv -d

2024-06-06 15:23:07,771 [DEBUG] Logging verbosity set to 2
2024-06-06 15:23:07,776 [DEBUG] File clouds.yaml found in folder ~/.config/bibigrid.
2024-06-06 15:23:07,778 [DEBUG] File clouds-public.yaml not found in folder ~/.config/bibigrid.
2024-06-06 15:23:07,778 [DEBUG] File clouds-public.yaml not found in folder /etc/bibigrid.
2024-06-06 15:23:07,779 [DEBUG] File clouds-public.yaml not found in folder .
2024-06-06 15:23:07,779 [DEBUG] Loaded clouds.yml and clouds_public.yml
2024-06-06 15:23:07,779 [DEBUG] Using only clouds.yaml since no clouds-public profile is set.
2024-06-06 15:23:09,253 [INFO] Action create selected
2024-06-06 15:23:11,856 [DEBUG] Cluster-ID: 3k2ifh8oi96cwe3
2024-06-06 15:23:11,856 [DEBUG] Keyname: tempKey_bibi-3k2ifh8oi96cwe3
2024-06-06 15:23:11,856 [PRINT] Creating a new cluster takes about 10 or more minutes depending on your cloud provider and your configuration. Be patient.
2024-06-06 15:23:11,856 [INFO] Generating keypair
2024-06-06 15:23:11,873 [DEBUG] Generating public/private ecdsa key pair.
Your identification has been saved in /home/ubuntu/.config/bibigrid/keys/tempKey_bibi-3k2ifh8oi96cwe3
Your public key has been saved in /home/ubuntu/.config/bibigrid/keys/tempKey_bibi-3k2ifh8oi96cwe3.pub
The key fingerprint is:
SHA256:n3TmYaI/vMvphRz++VQ6z/Oq83dqPnX9hcJ1vFYDV0I ubuntu@slurm-setup
The key's randomart image is:
+---[ECDSA 256]---+
|             .E o|
|             . o |
|              o. |
|              ..+|
|        S +.=. o*|
|         * Xo..==|
|        ..* o.=.+|
|         oo+.oo=+|
|         .B++B*=B|
+----[SHA256]-----+

2024-06-06 15:23:11,977 [DEBUG] No network found. Getting network by subnet.
2024-06-06 15:23:12,103 [DEBUG] Getting subnets by network.
2024-06-06 15:23:12,537 [INFO] Generating Security Groups
2024-06-06 15:23:15,119 [INFO] Starting instance/server bibigrid-master-3k2ifh8oi96cwe3 on openstack
2024-06-06 15:23:37,445 [INFO] Ansible preparation...
2024-06-06 15:23:40,474 [INFO] Attempting to connect to 172.17.1.50... This might take a while
2024-06-06 15:23:41,477 [INFO] Attempting to connect to 172.17.1.50... This might take a while
2024-06-06 15:23:43,480 [INFO] Attempting to connect to 172.17.1.50... This might take a while
2024-06-06 15:23:48,504 [INFO] Attempting to connect to 172.17.1.50... This might take a while
2024-06-06 15:23:56,696 [ERROR] Traceback (most recent call last):
  File "/home/ubuntu/bibigrid/bibigrid/core/actions/create.py", line 378, in create
    self.initialize_instances()
  File "/home/ubuntu/bibigrid/bibigrid/core/actions/create.py", line 236, in initialize_instances
    ssh_handler.ansible_preparation(floating_ip=configuration["floating_ip"],
  File "/home/ubuntu/bibigrid/bibigrid/core/utility/handler/ssh_handler.py", line 207, in ansible_preparation
    execute_ssh(floating_ip, private_key, username, log, gateway, commands, filepaths)
  File "/home/ubuntu/bibigrid/bibigrid/core/utility/handler/ssh_handler.py", line 227, in execute_ssh
    is_active(client=client, floating_ip_address=floating_ip, username=username, private_key=paramiko_key,
  File "/home/ubuntu/bibigrid/bibigrid/core/utility/handler/ssh_handler.py", line 112, in is_active
    client.connect(hostname=gateway.get("ip") or floating_ip_address, username=username, pkey=private_key,
  File "/home/ubuntu/.venv/bibigrid/lib/python3.8/site-packages/paramiko/client.py", line 485, in connect
    self._auth(
  File "/home/ubuntu/.venv/bibigrid/lib/python3.8/site-packages/paramiko/client.py", line 818, in _auth
    raise saved_exception
  File "/home/ubuntu/.venv/bibigrid/lib/python3.8/site-packages/paramiko/client.py", line 716, in _auth
    self._transport.auth_publickey(username, pkey)
  File "/home/ubuntu/.venv/bibigrid/lib/python3.8/site-packages/paramiko/transport.py", line 1674, in auth_publickey
    return self.auth_handler.wait_for_response(my_event)
  File "/home/ubuntu/.venv/bibigrid/lib/python3.8/site-packages/paramiko/auth_handler.py", line 248, in wait_for_response
    raise e
paramiko.ssh_exception.AuthenticationException: Authentication failed: transport shut down or saw EOF

2024-06-06 15:23:56,696 [ERROR] Unexpected error: 'Authentication failed: transport shut down or saw EOF' (<class 'paramiko.ssh_exception.AuthenticationException'>) Contact a developer!)
DEBUG MODE: Any non-empty input to shutdown cluster 3k2ifh8oi96cwe3. Empty input to exit with cluster still alive:
2024-06-06 15:25:37,512 [PRINT] --- 2 minutes and 29.74 seconds ---

I was able to log into the masterVM. ssh -i ~/.config/bibigrid/keys/tempKey_bibi-3k2ifh8oi96cwe3 ubuntu@{IP}

vschnei commented 1 month ago
# See https://cloud.denbi.de/wiki/Tutorials/BiBiGrid/ (after update)
  # First configuration will be used for general cluster information and must include the master.
  # All other configurations mustn't include another master, but exactly one vpnWorker instead (keys like master).

- infrastructure: openstack # former mode.
  cloud: openstack # name of clouds.yaml entry

  # -- BEGIN: GENERAL CLUSTER INFORMATION --
  ## sshPublicKeyFiles listed here will be added to access the cluster. A temporary key is created by bibigrid itself.
  #sshPublicKeyFiles:
  #  -  [add path to your ssh key]

  ## Volumes and snapshots that will be mounted to master
  #masterMounts:
  #  - [mount one]
  ## Uncomment if you don't want assign a public ip to the master; for internal cluster
  # useMasterWithPublicIp: False

  # Other keys
  # localFS: False
  # localDNSlookup: False
  zabbix: True
  nfs: True
  ide: True

  useMasterAsCompute: False

  # master configuration
  masterInstance:
    type: de.NBI default
    image: Ubuntu-22.04
  # -- END: GENERAL CLUSTER INFORMATION --

  # worker configuration
  workerInstances:
    - type: de.NBI default
      image: Ubuntu-22.04
      count: 2

  # Depends on cloud image
  sshUser: ubuntu

  # Depends on your project and cloud site
  subnet: CloudPlatform-subnet-2

  # Depends on the services that run on server start at your cloud site
  # Suspend worker nodes after 9 hours being idle
  slurmConf:
    elastic_scheduling:
      SuspendTime: 32400
XaverStiensmeier commented 1 month ago

Mhm, your configuration looks fine.

I am pretty sure it is not the timeout, but an error that occurs once the master is ready for the ssh connection, but for some reason doesn't accept the connection. Can you try to check if the problem persists if you use bibigrid's current dev branch (git checkout dev) and set sshTimeout: 12 in your bibigrid.yml?

As I said earlier, I don't think the timeout is the issue, but better to confirm it. Also, use -vv again because we have increased the number of logs during the ssh connection and that might include helpful information.

vschnei commented 1 month ago

Switching to dev branch did not improve anything.

See logs: ./bibigrid.sh -i bibigrid.yml -c -vv

2024-06-06 16:21:07,273 [DEBUG] Logging verbosity set to 2
2024-06-06 16:21:07,278 [DEBUG] File clouds.yaml found in folder ~/.config/bibigrid.
2024-06-06 16:21:07,281 [DEBUG] File clouds-public.yaml not found in folder ~/.config/bibigrid.
2024-06-06 16:21:07,281 [DEBUG] File clouds-public.yaml not found in folder /etc/bibigrid.
2024-06-06 16:21:07,281 [DEBUG] File clouds-public.yaml not found in folder .
2024-06-06 16:21:07,282 [DEBUG] Loaded clouds.yml and clouds_public.yml
2024-06-06 16:21:07,282 [DEBUG] Using only clouds.yaml since no clouds-public profile is set.
2024-06-06 16:21:08,703 [INFO] Action create selected
2024-06-06 16:21:11,483 [DEBUG] Cluster-ID: 4v723t5qa23xyci
2024-06-06 16:21:11,483 [DEBUG] Keyname: tempKey_bibi-4v723t5qa23xyci
2024-06-06 16:21:11,483 [PRINT] Creating a new cluster takes about 10 or more minutes depending on your cloud provider and your configuration. Please be patient.
2024-06-06 16:21:11,483 [INFO] Generating keypair
2024-06-06 16:21:11,501 [DEBUG] Generating public/private ecdsa key pair.
Your identification has been saved in /home/ubuntu/.config/bibigrid/keys/tempKey_bibi-4v723t5qa23xyci
Your public key has been saved in /home/ubuntu/.config/bibigrid/keys/tempKey_bibi-4v723t5qa23xyci.pub
The key fingerprint is:
SHA256:xR2qHuDac4JQQkzdO7JA8FSKlA1u+HkDZ1RBegF7Iw0 ubuntu@slurm-setup
The key's randomart image is:
+---[ECDSA 256]---+
|.*BoE==.    .    |
|+*o+.=.. . o .   |
|o+= B *.  + .    |
|...B.=oo o       |
|  +.oo..S        |
|   o.= . .       |
|    o + o        |
|       +         |
|                 |
+----[SHA256]-----+

2024-06-06 16:21:11,603 [DEBUG] No network found. Getting network by subnet.
2024-06-06 16:21:11,733 [DEBUG] Getting subnets by network.
2024-06-06 16:21:12,153 [DEBUG] Creating default files
2024-06-06 16:21:12,153 [DEBUG] Copying ansible.cfg
2024-06-06 16:21:12,154 [DEBUG] Copying slurm.conf
2024-06-06 16:21:12,155 [INFO] Generating Security Groups
2024-06-06 16:21:13,742 [DEBUG] Writing yaml /home/ubuntu/bibigrid/resources/playbook/vars/hosts.yml
2024-06-06 16:21:13,742 [ERROR] Tried to access resource files but couldn't. No such file or directory: [Errno 2] No such file or directory: '/home/ubuntu/bibigrid/resources/playbook/vars/hosts.yml'
2024-06-06 16:21:13,743 [INFO] Deleting Keypair locally...
2024-06-06 16:21:13,743 [INFO] Terminating cluster 4v723t5qa23xyci on cloud openstack
2024-06-06 16:21:14,527 [INFO] Deleting servers on provider openstack...
2024-06-06 16:21:14,528 [INFO] Deleting Keypair on provider openstack...
2024-06-06 16:21:14,596 [INFO] Keypair tempKey_bibi-4v723t5qa23xyci deleted on provider openstack.
2024-06-06 16:21:14,596 [INFO] Deleting security groups on provider openstack...
2024-06-06 16:21:15,309 [INFO] Delete security_group default-4v723t5qa23xyci -> True on openstack.
2024-06-06 16:21:15,310 [INFO] Because you used application credentials to authenticate, no created application credentials need deletion.
2024-06-06 16:21:15,310 [WARNING] Unable to find any servers for cluster-id 4v723t5qa23xyci. Check cluster-id and configuration.
All keys deleted: True
2024-06-06 16:21:15,310 [PRINT] --- 0 minutes and 8.03 seconds ---
XaverStiensmeier commented 1 month ago

Can you git pull and retry? The dev error is fixed now.

vschnei commented 1 month ago

Hi, your changes in the dev branch improved the deployment. Though there is still a problem: I show only the bottom of the logs.

2024-06-07 09:17:02,802 [DEBUG] REMOTE: TASK [bibigrid : Disable and Stop systemd-resolve] *****************************
2024-06-07 09:17:04,650 [DEBUG] REMOTE: changed: [localhost]
2024-06-07 09:17:04,672 [DEBUG] REMOTE: 
2024-06-07 09:17:04,673 [DEBUG] REMOTE: TASK [bibigrid : Remove /etc/resolv.conf] **************************************
2024-06-07 09:17:05,148 [DEBUG] REMOTE: changed: [localhost]
2024-06-07 09:17:05,166 [DEBUG] REMOTE
2024-06-07 09:17:05,167 [DEBUG] REMOTE: TASK [bibigrid : Install dnsmasq] **********************************************
2024-06-07 09:17:14,869 [DEBUG] REMOTE: fatal: [localhost]: FAILED! => {"cache_update_time": 1717744608, "cache_updated": false, "changed": false, "msg": "'/usr/bin/apt-get -y -o \"Dpkg::Options::=--force-confdef\" -o \"Dpkg::Options::=--force-confold\"       install 'dnsmasq=2.90-0ubuntu0.22.04.1'' failed: E: Failed to fetch http://nova.clouds.archive.ubuntu.com/ubuntu/pool/main/d/dns-root-data/dns-root-data_2023112702%7eubuntu0.22.04.1_all.deb  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/d/dnsmasq/dnsmasq-base_2.90-0ubuntu0.22.04.1_amd64.deb  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/d/dnsmasq/dnsmasq_2.90-0ubuntu0.22.04.1_all.deb  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?\n", "rc": 100, "stderr": "E: Failed to fetch http://nova.clouds.archive.ubuntu.com/ubuntu/pool/main/d/dns-root-data/dns-root-data_2023112702%7eubuntu0.22.04.1_all.deb  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/d/dnsmasq/dnsmasq-base_2.90-0ubuntu0.22.04.1_amd64.deb  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/d/dnsmasq/dnsmasq_2.90-0ubuntu0.22.04.1_all.deb  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nE: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?\n", "stderr_lines": ["E: Failed to fetch http://nova.clouds.archive.ubuntu.com/ubuntu/pool/main/d/dns-root-data/dns-root-data_2023112702%7eubuntu0.22.04.1_all.deb  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'", "E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/main/d/dnsmasq/dnsmasq-base_2.90-0ubuntu0.22.04.1_amd64.deb  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'", "E: Failed to fetch http://security.ubuntu.com/ubuntu/pool/universe/d/dnsmasq/dnsmasq_2.90-0ubuntu0.22.04.1_all.deb  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'", "E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?"], "stdout": "Reading package lists...\nBuilding dependency tree...\nReading state information...\nThe following additional packages will be installed:\n  dns-root-data dnsmasq-base\nSuggested packages:\n  resolvconf\nThe following NEW packages will be installed:\n  dns-root-data dnsmasq dnsmasq-base\n0 upgraded, 3 newly installed, 0 to remove and 3 not upgraded.\nNeed to get 399 kB of archives.\nAfter this operation, 1025 kB of additional disk space will be used.\nIgn:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1\nIgn:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1\nIgn:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1\nIgn:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1\nIgn:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1\nIgn:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1\nIgn:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1\nIgn:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1\nIgn:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1\nErr:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1\n  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nIgn:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1\nIgn:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1\nErr:2 http://security.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1\n  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\nErr:3 http://security.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1\n  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'\n", "stdout_lines": ["Reading package lists...", "Building dependency tree...", "Reading state information...", "The following additional packages will be installed:", "  dns-root-data dnsmasq-base", "Suggested packages:", "  resolvconf", "The following NEW packages will be installed:", "  dns-root-data dnsmasq dnsmasq-base", "0 upgraded, 3 newly installed, 0 to remove and 3 not upgraded.", "Need to get 399 kB of archives.", "After this operation, 1025 kB of additional disk space will be used.", "Ign:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1", "Ign:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1", "Ign:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1", "Ign:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1", "Ign:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1", "Ign:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1", "Ign:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1", "Ign:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1", "Ign:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1", "Err:1 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dns-root-data all 2023112702~ubuntu0.22.04.1", "  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'", "Ign:2 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1", "Ign:3 http://nova.clouds.archive.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1", "Err:2 http://security.ubuntu.com/ubuntu jammy-updates/main amd64 dnsmasq-base amd64 2.90-0ubuntu0.22.04.1", "  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'", "Err:3 http://security.ubuntu.com/ubuntu jammy-updates/universe amd64 dnsmasq all 2.90-0ubuntu0.22.04.1", "  Temporary failure resolving 'nova.clouds.archive.ubuntu.com'"]}
2024-06-07 09:17:14,873 [DEBUG] REMOTE: 
2024-06-07 09:17:14,873 [DEBUG] REMOTE: PLAY RECAP *********************************************************************
2024-06-07 09:17:14,873 [DEBUG] REMOTE: localhost                  : ok=18   changed=12   unreachable=0    failed=1    skipped=18   rescued=0    ignored=0
2024-06-07 09:17:14,874 [DEBUG] REMOTE: 
2024-06-07 09:17:15,058 [WARNING] Execute ansible playbook. Be patient. ... Exit status: 2
2024-06-07 09:17:15,058 [ERROR] Execution of cmd on remote host fails: Execute ansible playbook. Be patient. ... Exit status: 2
2024-06-07 09:17:15,059 [INFO] Deleting Keypair locally...
2024-06-07 09:17:15,059 [INFO] Terminating cluster 5jyhqcz1fpe7ipq on cloud openstack
2024-06-07 09:17:16,472 [INFO] Deleting servers on provider openstack...
2024-06-07 09:17:16,474 [INFO] Trying to terminate Server bibigrid-master-5jyhqcz1fpe7ipq on cloud openstack.
2024-06-07 09:17:19,319 [INFO] Server bibigrid-master-5jyhqcz1fpe7ipq terminated on provider openstack.
2024-06-07 09:17:19,319 [INFO] Deleting Keypair on provider openstack...
2024-06-07 09:17:19,422 [INFO] Keypair tempKey_bibi-5jyhqcz1fpe7ipq deleted on provider openstack.
2024-06-07 09:17:19,423 [INFO] Deleting security groups on provider openstack...
2024-06-07 09:17:22,909 [INFO] Retrying to delete security group default-5jyhqcz1fpe7ipq on openstack. Attempt 1/5
2024-06-07 09:17:23,620 [INFO] Delete security_group default-5jyhqcz1fpe7ipq -> True on openstack.
2024-06-07 09:17:23,620 [INFO] Because you used application credentials to authenticate, no created application credentials need deletion.
2024-06-07 09:17:23,620 [INFO] Terminated all servers of cluster 5jyhqcz1fpe7ipq.
2024-06-07 09:17:23,621 [INFO] Deleted all keypairs of cluster 5jyhqcz1fpe7ipq.
2024-06-07 09:17:23,621 [INFO] Deleted all security groups of cluster 5jyhqcz1fpe7ipq.
2024-06-07 09:17:23,621 [PRINT] Successfully terminated cluster 5jyhqcz1fpe7ipq.
2024-06-07 09:17:23,621 [INFO] Successfully handled application credential of cluster 5jyhqcz1fpe7ipq.
2024-06-07 09:17:23,622 [PRINT] --- 22 minutes and 12.8 seconds ---
XaverStiensmeier commented 1 month ago

Out of interest: Could you also show me how many attempts paramiko needed to connect? The Attempting to connect ... lines. I would like to assess whether that was actually the problem.

Your resolving problem might have to do with your cloud location. Maybe a connectivity issue or maybe a post-launch service that interferes (see waitForServices). Sadly, those services are often not well documented because waiting for them is only necessary when the startup happens very fast. I will contact Jan, a colleague, regarding possible cloud location issues.

I would recommend to try the following in the meantime even though it is more a workaround than anything:

  1. Start the cluster again but with -d.
  2. when it fails, wait a little then log into the master instance and execute bibiplay -l master which basically executes the master setup via ansible playbook manually. This is now possible because the failure happens after copying all the necessary files onto the master.
  3. Please report back whether this runs through without issues.

I apologize for the trouble.

vschnei commented 1 month ago

Hey, you do not have to appologize. I am happy that you are so quick about it.

Here the connection attemt information:

2024-06-07 11:52:17,160 [INFO] Attempting to connect to 172.17.1.109... This might take a while
2024-06-07 11:52:17,160 [INFO] Attempt 0/12. Connecting to 172.17.1.109
2024-06-07 11:52:24,190 [INFO] Waiting 4 before attempting to reconnect.
2024-06-07 11:52:24,191 [INFO] Attempt 1/12. Connecting to 172.17.1.109
2024-06-07 11:52:32,200 [INFO] Waiting 8 before attempting to reconnect.
2024-06-07 11:52:32,201 [INFO] Attempt 2/12. Connecting to 172.17.1.109
2024-06-07 11:52:48,218 [INFO] Waiting 16 before attempting to reconnect.
2024-06-07 11:52:48,219 [INFO] Attempt 3/12. Connecting to 172.17.1.109
2024-06-07 11:52:48,342 [INFO] Successfully connected to 172.17.1.109.
2024-06-07 11:52:48,342 [DEBUG] Setting up 172.17.1.109
2024-06-07 11:52:48,342 [DEBUG] Setting up filepaths for 172.17.1.109
2024-06-07 11:52:49,862 [DEBUG] Copy /home/ubuntu/.config/bibigrid/keys/tempKey_bibi-hotdkesuzdqzjg3 to .ssh/id_ecdsa...

From the master VM i was able to execute bibiplay -l master but not without an error:

PLAY [master] *******************************************************************************************************************************************************************

TASK [Gathering Facts] **********************************************************************************************************************************************************
ok: [localhost]

TASK [bibigrid : Running 000-add-ip-routes.yml] *********************************************************************************************************************************
skipping: [localhost]

TASK [bibigrid : Collect files] *************************************************************************************************************************************************
skipping: [localhost]

TASK [bibigrid : Copy files] ****************************************************************************************************************************************************
skipping: [localhost]

TASK [bibigrid : Remove collected files] ****************************************************************************************************************************************
skipping: [localhost]

TASK [bibigrid : Disable cloud network changes after initialization] ************************************************************************************************************
skipping: [localhost]

TASK [bibigrid : Generate location specific worker userdata] ********************************************************************************************************************
skipping: [localhost]

TASK [bibigrid : Generate location specific worker userdata] ********************************************************************************************************************
skipping: [localhost]

TASK [bibigrid : Running 000-playbook-rights-server.yml] ************************************************************************************************************************
ok: [localhost] => {
    "msg": "[BIBIGRID] Update permissions"
}

TASK [bibigrid : Assure existence of ansible group] *****************************************************************************************************************************
ok: [localhost]

TASK [bibigrid : Change mode of /opt/slurm directory] ***************************************************************************************************************************
ok: [localhost]

TASK [bibigrid : Running 001-apt.yml] *******************************************************************************************************************************************
ok: [localhost] => {
    "msg": "[BIBIGRID] Setup common software and dependencies"
}

TASK [bibigrid : Debian based system] *******************************************************************************************************************************************
ok: [localhost] => {
    "msg": "Using apt to install packages"
}

TASK [bibigrid : Disable auto-update/upgrade during ansible-run] ****************************************************************************************************************
ok: [localhost]

TASK [bibigrid : Wait for cloud-init / user-data to finish] *********************************************************************************************************************
ok: [localhost]

TASK [bibigrid : Wait for /var/lib/dpkg/lock-frontend to be released] ***********************************************************************************************************
changed: [localhost]

TASK [bibigrid : Wait for post-launch services to stop] *************************************************************************************************************************
skipping: [localhost]

TASK [bibigrid : Update] ********************************************************************************************************************************************************
fatal: [localhost]: FAILED! => {"changed": false, "msg": "Failed to update apt cache: unknown reason"}

PLAY RECAP **********************************************************************************************************************************************************************
localhost                  : ok=9    changed=1    unreachable=0    failed=1    skipped=8    rescued=0    ignored=0   
XaverStiensmeier commented 1 month ago

I am afraid you encounter a cloud site issue. My colleague is currently looking into his own Berlin project to see if we can do anything on bibigrid's side to mitigate it. You could try bibiplay -l master -vvvvvvv for some additional information (please share that log here). Both to see if the playbook fails at the same task and whether there is some helpful additional information in ansible's debug log.

If it is feasible for you, all three of us can also hold a Zoom session together to look at the running but failed instance and find the problem.

vschnei commented 1 month ago

Let's see if your colleague is able to identify the problem.

Here is the last part of bibiplay -l master -vvvvvvv

It looks like there is a problem in name resolution

<localhost> SSH: EXEC ssh -vvv -o ControlMaster=auto -o ControlPersist=60s -o StrictHostKeyChecking=no -o KbdInteractiveAuthentication=no -o PreferredAuthentications=gssapi-with-mic,gssapi-keyex,hostbased,publickey -o PasswordAuthentication=no -o 'User="ubuntu"' -o ConnectTimeout=60 -o 'ControlPath="/home/ubuntu/.ansible/cp/a64ccf8ffb"' localhost '/bin/sh -c '"'"'sudo -H -S -n  -u root /bin/sh -c '"'"'"'"'"'"'"'"'echo BECOME-SUCCESS-ccwsqjfbwguuybwinzwhtlwgbdacpxzy ; /usr/bin/python3'"'"'"'"'"'"'"'"' && sleep 0'"'"''
Escalation succeeded
<localhost> (1, b'\n{"failed": true, "msg": "Failed to update apt cache: unknown reason", "invocation": {"module_args": {"update_cache": true, "upgrade": "yes", "state": "present", "update_cache_retries": 5, "update_cache_retry_max_delay": 12, "cache_valid_time": 0, "purge": false, "force": false, "dpkg_options": "force-confdef,force-confold", "autoremove": false, "autoclean": false, "fail_on_autoremove": false, "only_upgrade": false, "force_apt_get": false, "clean": false, "allow_unauthenticated": false, "allow_downgrade": false, "allow_change_held_packages": false, "lock_timeout": 60, "package": null, "deb": null, "default_release": null, "install_recommends": null, "policy_rc_d": null}}}\n', b"OpenSSH_8.9p1 Ubuntu-3ubuntu0.7, OpenSSL 3.0.2 15 Mar 2022\r\ndebug1: Reading configuration data /etc/ssh/ssh_config\r\ndebug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files\r\ndebug1: /etc/ssh/ssh_config line 21: Applying options for *\r\ndebug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/home/ubuntu/.ssh/known_hosts'\r\ndebug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/home/ubuntu/.ssh/known_hosts2'\r\ndebug1: auto-mux: Trying existing master\r\ndebug2: fd 3 setting O_NONBLOCK\r\ndebug2: mux_client_hello_exchange: master version 4\r\ndebug3: mux_client_forwards: request forwardings: 0 local, 0 remote\r\ndebug3: mux_client_request_session: entering\r\ndebug3: mux_client_request_alive: entering\r\ndebug3: mux_client_request_alive: done pid = 21057\r\ndebug3: mux_client_request_session: session request sent\r\ndebug1: mux_client_request_session: master session id: 2\r\nsudo: unable to resolve host bibigrid-master-hotdkesuzdqzjg3: Temporary failure in name resolution\ndebug3: mux_client_read_packet: read header failed: Broken pipe\r\ndebug2: Received exit status from master 1\r\n")
<localhost> Failed to connect to the host via ssh: OpenSSH_8.9p1 Ubuntu-3ubuntu0.7, OpenSSL 3.0.2 15 Mar 2022
debug1: Reading configuration data /etc/ssh/ssh_config
debug1: /etc/ssh/ssh_config line 19: include /etc/ssh/ssh_config.d/*.conf matched no files
debug1: /etc/ssh/ssh_config line 21: Applying options for *
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts' -> '/home/ubuntu/.ssh/known_hosts'
debug3: expanded UserKnownHostsFile '~/.ssh/known_hosts2' -> '/home/ubuntu/.ssh/known_hosts2'
debug1: auto-mux: Trying existing master
debug2: fd 3 setting O_NONBLOCK
debug2: mux_client_hello_exchange: master version 4
debug3: mux_client_forwards: request forwardings: 0 local, 0 remote
debug3: mux_client_request_session: entering
debug3: mux_client_request_alive: entering
debug3: mux_client_request_alive: done pid = 21057
debug3: mux_client_request_session: session request sent
debug1: mux_client_request_session: master session id: 2
sudo: unable to resolve host bibigrid-master-hotdkesuzdqzjg3: Temporary failure in name resolution
debug3: mux_client_read_packet: read header failed: Broken pipe
debug2: Received exit status from master 1
fatal: [localhost]: FAILED! => {
    "changed": false,
    "invocation": {
        "module_args": {
            "allow_change_held_packages": false,
            "allow_downgrade": false,
            "allow_unauthenticated": false,
            "autoclean": false,
            "autoremove": false,
            "cache_valid_time": 0,
            "clean": false,
            "deb": null,
            "default_release": null,
            "dpkg_options": "force-confdef,force-confold",
            "fail_on_autoremove": false,
            "force": false,
            "force_apt_get": false,
            "install_recommends": null,
            "lock_timeout": 60,
            "only_upgrade": false,
            "package": null,
            "policy_rc_d": null,
            "purge": false,
            "state": "present",
            "update_cache": true,
            "update_cache_retries": 5,
            "update_cache_retry_max_delay": 12,
            "upgrade": "yes"
        }
    },
    "msg": "Failed to update apt cache: unknown reason"
}

PLAY RECAP **********************************************************************************************************************************************************************
localhost                  : ok=9    changed=1    unreachable=0    failed=1    skipped=8    rescued=0    ignored=0   
XaverStiensmeier commented 1 month ago

If it is feasible for you, all three of us can also hold a Zoom session together to look at the running but failed instance and find the problem. Currently, it looks like something is off with the cloud's DNS which we can't really fix.

The mistake was on our side. I am currently fixing it and will come back to you as soon as possible.

XaverStiensmeier commented 1 month ago

Please try creating a new cluster using the branch 509-dns-breaks-down-hotfix. The issue was that we removed /etc/resolv.conf before installing dnsmasq (our preferred dns solution). For cloud site specific reasons this didn't cause any issue on the Bielefeld cloud, but does so on many other cloud sites.

Thank you for bringing this to our attention!

EDIT: And please report back if that fixed it.

vschnei commented 1 month ago

I have tried the new branch and the previously reported problem did not appear anymore. After restarting, the deployment stuck at the step: "Installing Docker". So I removed docker installation from resources/playbook/roles/bibigrid/tasks/main.yml, since it is not essential for the cluster deployment, at least to my opinion.

Now the deployment stuck at:

REMOTE: TASK [bibigrid : Start slurm explicit after all dependencies are configured] ***
REMOTE: fatal: [localhost]: FAILED! => {"changed": false, "msg": "Unable to start service slurmctld: Job for slurmctld.service failed because the control process exited with error code.\nSee \"systemct
l status slurmctld.service\" and \"journalctl -xeu slurmctld.service\" for details.\n"}: 

error: Parse error in file /etc/slurm/slurm.conf line 83: "SuspendExcNodes="

The problem is /etc/slurm/slurm.conf where SuspendExcNodes= is commented out but not value is assigned.

XaverStiensmeier commented 1 month ago

The template should assign a list to SuspendExcNodes containing all workers and the master where on_demand: false. This should always include the master. Can you send me your bibigrid/resources/playbook/group_vars/*? In the master file on_demand: false should be set.

The initial issue has been solved by adding more attempts to the ssh timeout. Other issues have then be discussed and later moved to chat.