clusterinthecloud / ansible

Ansible config for Cluster in the Cloud
https://cluster-in-the-cloud.readthedocs.io
MIT License
10 stars 27 forks source link

OpenStack: Compute node image creation uses project default security group #147

Open jcwomack opened 4 months ago

jcwomack commented 4 months ago

When trying to build a compute node image for a CitC instance deployed in a new project on Bristol Digital Labs prototype system's OpenStack, I found Packer was unable to complete the image build:

[citc@mgmt ~]$ sudo /usr/local/bin/run-packer
openstack.openstack: output will be in this color.

==> openstack.openstack: Loading flavor: m1.small
    openstack.openstack: Verified flavor. ID: 1d816549-b9ad-47d4-9139-218bfc22681f
==> openstack.openstack: Creating temporary keypair: packer_6619b72d-0884-3bb9-1dbc-800d6e2ddb50 ...
==> openstack.openstack: Created temporary keypair: packer_6619b72d-0884-3bb9-1dbc-800d6e2ddb50
    openstack.openstack: Found Image ID: b9e4cc7a-ed29-4a15-807b-dc80cdbd9983
==> openstack.openstack: Launching server...
==> openstack.openstack: Launching server...
    openstack.openstack: Server ID: c8d3dfc3-6a36-4b4f-b2ea-d93ee3c0bf88
==> openstack.openstack: Waiting for server to become ready...
    openstack.openstack: Floating IP not required
==> openstack.openstack: Using SSH communicator to connect: 10.0.1.96
==> openstack.openstack: Waiting for SSH to become available...
==> openstack.openstack: Timeout waiting for SSH.
==> openstack.openstack: Terminating the source server: c8d3dfc3-6a36-4b4f-b2ea-d93ee3c0bf88 ...
==> openstack.openstack: Deleting temporary keypair: packer_6619b72d-0884-3bb9-1dbc-800d6e2ddb50 ...
Build 'openstack.openstack' errored after 5 minutes 13 seconds: Timeout waiting for SSH.

I found that the instance was created but Packer does not seem to be able to connect to it over SSH, so image build fails.

Inspecting OpenStack instance details, I found that the instance was created with "default" security group:

% openstack server show c8ecea93-12c3-4c3e-bcb5-c02f37f9da6c -c hostname -c flavor -c image -c security_groups
+-----------------+--------------------------------------------------+
| Field           | Value                                            |
+-----------------+--------------------------------------------------+
| flavor          | m1.small (m1.small)                              |
| hostname        | packer-one-airedale-v1712962160                  |
| image           | Rocky-8.8 (b9e4cc7a-ed29-4a15-807b-dc80cdbd9983) |
| security_groups | name='default'                                   |
|                 | name='default'                                   |
+-----------------+--------------------------------------------------+

On inspection, I found that the default security group for this project did not have a rule that allowed SSH ingress from the mgmt instance and in general we cannot rely on this being the case.

I believe the reason why this problem has not arisen before on the Bristol Digital Labs prototype systems is that previous CitC deployments have been in a project where the default security group had a rule added which allowed ingress from any IP on 22/TCP.

To workaround this issue, I modified the local clone of this repository to specify that the image build instance should use the cluster-one-airedale security group created for this CitC instance.

[root@mgmt citc]# git -C /root/citc-ansible/ diff
diff --git a/roles/packer/files/all.pkr.hcl b/roles/packer/files/all.pkr.hcl
index 9c6960f..b81671e 100644
--- a/roles/packer/files/all.pkr.hcl
+++ b/roles/packer/files/all.pkr.hcl
@@ -117,6 +117,9 @@ source "openstack" "openstack" {
     source_image_name = "Rocky-8.8"
     ssh_username = var.ssh_username
     networks = [var.openstack_network, var.openstack_ceph_network]
+    security_groups = ["cluster-one-airedale"]
     image_tags = ["compute"]
     metadata = {"cluster": var.cluster}
}

After re-running Ansible for the mgmt instance, I was able to successfully run Packer and build a new compute node image.

I think that this change could be implemented more generally by modifying roles/packer/files/all.pkr.hcl in this repo to use security groups specified as variables in roles/packer/templates/variables.pkrvars.hcl.j2.

Possibly also of interest: I note that the Packer build in this case seems to connect over SSH on an IP on the Ceph network, rather than the cluster network. It seems that security groups specified in the Packer config are applied to all ports on the instance, so this does not prevent image build from occurring.