coreos / fedora-coreos-pipeline

Build pipeline for Fedora CoreOS
https://jenkins-fedora-coreos.apps.ocp.ci.centos.org/
Other
55 stars 55 forks source link

NetworkNotFound error preventing server creation in VexxHost OpenStack #1061

Open marmijo opened 1 week ago

marmijo commented 1 week ago

Description

A NetworkNotFound error was seen in the FCOS OpenStack instance on VexxHost today. The error caused all server creation to fail in OpenStack, either in the kola-openstack job or locally through the CLI.

harness.go:1782: Cluster failed starting machines: waiting for instance to run: 
Server reported ERROR status: 
{500 2024-11-22 20:56:47 +0000 UTC  Build of instance 8837696b-6172-4721-aa4a-729e576573d5 aborted: Failed to allocate the network(s), not rescheduling.

The issue seems to be that the private network we attach to the servers is not available, but the network looks fine in the CLI and the cloud console.

When creating a server using openstack server create --debug --network=private <other server creation arguments> , the following debug message can be seen:

RESP BODY: {"NeutronError": {"type": "NetworkNotFound", "message": "Network private could not be found.", "detail": ""}}

The instance fails to launch and the error shows that the private network cannot be found. However:

Additional Information

OpenStack region doesn't seem to make a difference

The failure was seen using the ca-ymq-1 region in OpenStack, but I also saw the error when I tried creating a server in ams1 as well.

Timing of failure

We saw a successful kola-openstack run on 2024-11-22 8:54 UTC, but then saw the following error in a kola-openstack run at 2024-11-22 8:13 UTC

[2024-11-22T08:11:42.185Z] + ore openstack --config-file=**** --region=ca-ymq-1 create-image --file=/home/jenkins/agent/workspace/kola-openstack/builds/41.20241119.20.1/aarch64/fedora-coreos-41.20241119.20.1-openstack.aarch64.qcow2 --name=kola-fedora-coreos-testing-devel-aarch64 --arch=aarch64
[2024-11-22T08:12:50.015Z] Couldn't create image: creating image: Expected HTTP response code [201] when accessing [POST https://image.public.mtl1.vexxhost.net/v2/images], but got 504 instead
[2024-11-22T08:12:50.015Z] <html>
[2024-11-22T08:12:50.015Z] <head><title>504 Gateway Time-out</title></head>
[2024-11-22T08:12:50.015Z] <body>
[2024-11-22T08:12:50.015Z] <center><h1>504 Gateway Time-out</h1></center>
[2024-11-22T08:12:50.015Z] <hr><center>nginx</center>
[2024-11-22T08:12:50.015Z] </body>
[2024-11-22T08:12:50.015Z] </html>

We then started seeing these failures on all runs afterwards starting at 2024-11-22 17:52 UTC

Potentially there were some stability issues this morning that could have affected our host. VexxHost Status seems green though: https://status.vexxhost.com/

Nova Compute Logs

I searched for similar instances of this failure and articles/forums point towards checking the Nova Compute Logs at /var/log/nova/nova-compute.log and running a command as root on the host to resolve the issue. However, we dont have access to the host resources.

marmijo commented 1 week ago

PR to disable kola-openstack in the pipeline for the weekend: https://github.com/coreos/fedora-coreos-pipeline/pull/1062