Instance can't launch after successful deployment on grid5000 Rennes x Paravance

Boufcoulman commented 2 years ago

Hi,

I'm using again enos after a few months without it, and when I deploy (tested using both no kolla-ansible pinned or with train-eol tag in the reservation.yml file), the deployement goes well, but I can't properly create an instance. When creating an instance, in the UI I get this error : Error: Failed to perform requested operation on instance "test", the instance has an error status: Please try again later [Error: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance 3828f1c3-d7dd-4309-9119-5f3eecce826f.]. For curiosity, I checked every "nova" docker running on my nodes, and with docker logs nova_compute I get :

+ exec nova-compute
Running command: 'nova-compute'
Traceback (most recent call last):
  File "/usr/lib/python3.6/site-packages/eventlet/hubs/poll.py", line 111, in wait
    listener.cb(fileno)
  File "/usr/lib/python3.6/site-packages/eventlet/greenthread.py", line 221, in main
    result = function(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/nova/utils.py", line 675, in context_wrapper
    return func(*args, **kwargs)
  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 1716, in _allocate_network_async
    six.reraise(*exc_info)
  File "/usr/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/usr/lib/python3.6/site-packages/nova/compute/manager.py", line 1699, in _allocate_network_async
    resource_provider_mapping=resource_provider_mapping)
  File "/usr/lib/python3.6/site-packages/nova/network/neutronv2/api.py", line 1040, in allocate_for_instance
    bind_host_id, requested_ports_dict)
  File "/usr/lib/python3.6/site-packages/nova/network/neutronv2/api.py", line 1169, in _update_ports_for_instance
    vif.destroy()
  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 220, in __exit__
    self.force_reraise()
  File "/usr/lib/python3.6/site-packages/oslo_utils/excutils.py", line 196, in force_reraise
    six.reraise(self.type_, self.value, self.tb)
  File "/usr/lib/python3.6/site-packages/six.py", line 703, in reraise
    raise value
  File "/usr/lib/python3.6/site-packages/nova/network/neutronv2/api.py", line 1139, in _update_ports_for_instance
    port_client, instance, port_id, port_req_body)
  File "/usr/lib/python3.6/site-packages/nova/network/neutronv2/api.py", line 513, in _update_port
    _ensure_no_port_binding_failure(port)
  File "/usr/lib/python3.6/site-packages/nova/network/neutronv2/api.py", line 236, in _ensure_no_port_binding_failure
    raise exception.PortBindingFailed(port_id=port['id'])
nova.exception.PortBindingFailed: Binding failed for port 022879f1-3cc3-482f-b144-574f38d2b4f0, please check neutron logs for more information.
Removing descriptor: 18

and more errors in any of the other "neutron" or "nova" docker containers.

By going inside a container, I opened /var/log/kolla/neutron/neutro-server.log and the following error comes several times : ERROR neutron.plugins.ml2.managers [req-b8ad0917-7047-426a-8e39-06896734684b 47697201dac64047a19f0f2eafd75259 b4036f17699945f7b8ab107fb88fe77b - default default] Failed to bind port 022879f1-3cc3-482f-b144-574f38d2b4f0 on host paravance-13-kavlan-4.rennes.grid5000.fr for vnic_type normal using segments [{'id': '88d45cfd-8971-4d1c-88f2-1054a819eb6f', 'network_type': 'flat', 'physical_network': 'physnet1', 'segmentation_id': None, 'network_id': 'c35dbc4d-5ba7-4c5b-bb00-317edf08cdd6'}]

For every deployment, I'm using a clean python3 virtualenv with pip and enos up to date.

Is someone still able to make it work ?

jonglezb commented 2 years ago

I could reproduce the same issue, the problem is in fact coming from Nova on the compute nodes:

ERROR nova.virt.libvirt.driver Failed to start libvirt guest: libvirt.libvirtError: unable to open '/sys/fs/cgroup/machine/qemu-2-instance-00000008.libvirt-qemu/': No such file or directory

In my case I'm using debian11-min as my base G5K image, so the kernel/systemd is running cgroup v2, and I think that's part of the problem. Which image are you using?

jonglezb commented 2 years ago

So, using debian10-min on the G5K nodes still works fine.

Debian bullseye as host is only officially supported by Kolla-ansible starting with version 12:

We are currently using Kolla-ansible version 10. I tried to manually configure the cgroupns option in Docker with this version, but I couldn't make it work.

So, next step is updating to a newer kolla-ansible! It didn't work out-of-the-box, I will debug it more in the coming weeks.

Boufcoulman commented 1 year ago

Hello, I think I was already using debian10-min since in my reservation file I was not specifying otherwise. I retried today enos deploy, it worked, and I figured that the issue I was facing was when declaring instance with the default public network. If I first create it with the default private network and allocate manually a public IP, it works ! Sorry for the inconvenience.

jonglezb commented 1 year ago

Ah, yes, creating ports directly on the public network is not supported! It can only be used for floating IPs.

BeyondTheClouds / enos

Instance can't launch after successful deployment on grid5000 Rennes x Paravance #349