mpontillo commented 1 year ago

Required information

Distribution: Ubuntu
Distribution version: 22.04 "Jammy"
The output of "lxc info" or if that fails:
- Kernel version: 5.19.0-35-generic
- LXC version: 5.12
- LXD version: 5.12
- Storage backend in use: zfs

lxc info output

``` $ lxc info config: {} api_extensions: - storage_zfs_remove_snapshots - container_host_shutdown_timeout - container_stop_priority - container_syscall_filtering - auth_pki - container_last_used_at - etag - patch - usb_devices - https_allowed_credentials - image_compression_algorithm - directory_manipulation - container_cpu_time - storage_zfs_use_refquota - storage_lvm_mount_options - network - profile_usedby - container_push - container_exec_recording - certificate_update - container_exec_signal_handling - gpu_devices - container_image_properties - migration_progress - id_map - network_firewall_filtering - network_routes - storage - file_delete - file_append - network_dhcp_expiry - storage_lvm_vg_rename - storage_lvm_thinpool_rename - network_vlan - image_create_aliases - container_stateless_copy - container_only_migration - storage_zfs_clone_copy - unix_device_rename - storage_lvm_use_thinpool - storage_rsync_bwlimit - network_vxlan_interface - storage_btrfs_mount_options - entity_description - image_force_refresh - storage_lvm_lv_resizing - id_map_base - file_symlinks - container_push_target - network_vlan_physical - storage_images_delete - container_edit_metadata - container_snapshot_stateful_migration - storage_driver_ceph - storage_ceph_user_name - resource_limits - storage_volatile_initial_source - storage_ceph_force_osd_reuse - storage_block_filesystem_btrfs - resources - kernel_limits - storage_api_volume_rename - macaroon_authentication - network_sriov - console - restrict_devlxd - migration_pre_copy - infiniband - maas_network - devlxd_events - proxy - network_dhcp_gateway - file_get_symlink - network_leases - unix_device_hotplug - storage_api_local_volume_handling - operation_description - clustering - event_lifecycle - storage_api_remote_volume_handling - nvidia_runtime - container_mount_propagation - container_backup - devlxd_images - container_local_cross_pool_handling - proxy_unix - proxy_udp - clustering_join - proxy_tcp_udp_multi_port_handling - network_state - proxy_unix_dac_properties - container_protection_delete - unix_priv_drop - pprof_http - proxy_haproxy_protocol - network_hwaddr - proxy_nat - network_nat_order - container_full - candid_authentication - backup_compression - candid_config - nvidia_runtime_config - storage_api_volume_snapshots - storage_unmapped - projects - candid_config_key - network_vxlan_ttl - container_incremental_copy - usb_optional_vendorid - snapshot_scheduling - snapshot_schedule_aliases - container_copy_project - clustering_server_address - clustering_image_replication - container_protection_shift - snapshot_expiry - container_backup_override_pool - snapshot_expiry_creation - network_leases_location - resources_cpu_socket - resources_gpu - resources_numa - kernel_features - id_map_current - event_location - storage_api_remote_volume_snapshots - network_nat_address - container_nic_routes - rbac - cluster_internal_copy - seccomp_notify - lxc_features - container_nic_ipvlan - network_vlan_sriov - storage_cephfs - container_nic_ipfilter - resources_v2 - container_exec_user_group_cwd - container_syscall_intercept - container_disk_shift - storage_shifted - resources_infiniband - daemon_storage - instances - image_types - resources_disk_sata - clustering_roles - images_expiry - resources_network_firmware - backup_compression_algorithm - ceph_data_pool_name - container_syscall_intercept_mount - compression_squashfs - container_raw_mount - container_nic_routed - container_syscall_intercept_mount_fuse - container_disk_ceph - virtual-machines - image_profiles - clustering_architecture - resources_disk_id - storage_lvm_stripes - vm_boot_priority - unix_hotplug_devices - api_filtering - instance_nic_network - clustering_sizing - firewall_driver - projects_limits - container_syscall_intercept_hugetlbfs - limits_hugepages - container_nic_routed_gateway - projects_restrictions - custom_volume_snapshot_expiry - volume_snapshot_scheduling - trust_ca_certificates - snapshot_disk_usage - clustering_edit_roles - container_nic_routed_host_address - container_nic_ipvlan_gateway - resources_usb_pci - resources_cpu_threads_numa - resources_cpu_core_die - api_os - container_nic_routed_host_table - container_nic_ipvlan_host_table - container_nic_ipvlan_mode - resources_system - images_push_relay - network_dns_search - container_nic_routed_limits - instance_nic_bridged_vlan - network_state_bond_bridge - usedby_consistency - custom_block_volumes - clustering_failure_domains - resources_gpu_mdev - console_vga_type - projects_limits_disk - network_type_macvlan - network_type_sriov - container_syscall_intercept_bpf_devices - network_type_ovn - projects_networks - projects_networks_restricted_uplinks - custom_volume_backup - backup_override_name - storage_rsync_compression - network_type_physical - network_ovn_external_subnets - network_ovn_nat - network_ovn_external_routes_remove - tpm_device_type - storage_zfs_clone_copy_rebase - gpu_mdev - resources_pci_iommu - resources_network_usb - resources_disk_address - network_physical_ovn_ingress_mode - network_ovn_dhcp - network_physical_routes_anycast - projects_limits_instances - network_state_vlan - instance_nic_bridged_port_isolation - instance_bulk_state_change - network_gvrp - instance_pool_move - gpu_sriov - pci_device_type - storage_volume_state - network_acl - migration_stateful - disk_state_quota - storage_ceph_features - projects_compression - projects_images_remote_cache_expiry - certificate_project - network_ovn_acl - projects_images_auto_update - projects_restricted_cluster_target - images_default_architecture - network_ovn_acl_defaults - gpu_mig - project_usage - network_bridge_acl - warnings - projects_restricted_backups_and_snapshots - clustering_join_token - clustering_description - server_trusted_proxy - clustering_update_cert - storage_api_project - server_instance_driver_operational - server_supported_storage_drivers - event_lifecycle_requestor_address - resources_gpu_usb - clustering_evacuation - network_ovn_nat_address - network_bgp - network_forward - custom_volume_refresh - network_counters_errors_dropped - metrics - image_source_project - clustering_config - network_peer - linux_sysctl - network_dns - ovn_nic_acceleration - certificate_self_renewal - instance_project_move - storage_volume_project_move - cloud_init - network_dns_nat - database_leader - instance_all_projects - clustering_groups - ceph_rbd_du - instance_get_full - qemu_metrics - gpu_mig_uuid - event_project - clustering_evacuation_live - instance_allow_inconsistent_copy - network_state_ovn - storage_volume_api_filtering - image_restrictions - storage_zfs_export - network_dns_records - storage_zfs_reserve_space - network_acl_log - storage_zfs_blocksize - metrics_cpu_seconds - instance_snapshot_never - certificate_token - instance_nic_routed_neighbor_probe - event_hub - agent_nic_config - projects_restricted_intercept - metrics_authentication - images_target_project - cluster_migration_inconsistent_copy - cluster_ovn_chassis - container_syscall_intercept_sched_setscheduler - storage_lvm_thinpool_metadata_size - storage_volume_state_total - instance_file_head - instances_nic_host_name - image_copy_profile - container_syscall_intercept_sysinfo - clustering_evacuation_mode - resources_pci_vpd - qemu_raw_conf - storage_cephfs_fscache - network_load_balancer - vsock_api - instance_ready_state - network_bgp_holdtime - storage_volumes_all_projects - metrics_memory_oom_total - storage_buckets - storage_buckets_create_credentials - metrics_cpu_effective_total - projects_networks_restricted_access - storage_buckets_local - loki - acme - internal_metrics - cluster_join_token_expiry - remote_token_expiry - init_preseed - storage_volumes_created_at - cpu_hotplug - projects_networks_zones - network_txqueuelen - cluster_member_state - instances_placement_scriptlet - storage_pool_source_wipe - zfs_block_mode - instance_generation_id - disk_io_cache api_status: stable api_version: "1.0" auth: trusted public: false auth_methods: - tls environment: addresses: [] architectures: - x86_64 - i686 certificate: | -----BEGIN CERTIFICATE----- -----END CERTIFICATE----- certificate_fingerprint: 9c473ce74f6bda12dd4ec97c3a28cd8cd4063fcfbfcc0435d9bfe1e7ba7f15f6 driver: lxc | qemu driver_version: 5.0.2 | 7.1.0 firewall: nftables kernel: Linux kernel_architecture: x86_64 kernel_features: idmapped_mounts: "true" netnsid_getifaddrs: "true" seccomp_listener: "true" seccomp_listener_continue: "true" shiftfs: "false" uevent_injection: "true" unpriv_fscaps: "true" kernel_version: 5.19.0-35-generic lxc_features: cgroup2: "true" core_scheduling: "true" devpts_fd: "true" idmapped_mounts_v2: "true" mount_injection_file: "true" network_gateway_device_route: "true" network_ipvlan: "true" network_l2proxy: "true" network_phys_macvlan_mtu: "true" network_veth_router: "true" pidfd: "true" seccomp_allow_deny_syntax: "true" seccomp_notify: "true" seccomp_proxy_send_notify_fd: "true" os_name: Ubuntu os_version: "22.04" project: default server: lxd server_clustered: false server_event_mode: full-mesh server_name: timeloop server_pid: 1603373 server_version: "5.12" storage: zfs storage_version: 2.1.5-1ubuntu6 storage_supported_drivers: - name: ceph version: 17.2.0 remote: true - name: cephfs version: 17.2.0 remote: true - name: cephobject version: 17.2.0 remote: true - name: dir version: "1" remote: false - name: lvm version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.47.0 remote: false - name: zfs version: 2.1.5-1ubuntu6 remote: false - name: btrfs version: 5.16.2 remote: false ``` ---

Issue description

When attempting to use lxd to launch virtual machines in multiple peer containers, errors such as vhost-vsock: unable to set guest cid: Address already in use can be observed when launching VMs, causing the VMs to fail to start.

Steps to reproduce

Create a profile allowing nested virtualization, such as:

lxc profile create virt && \
lxc profile set virt security.nesting=true && \
lxc profile device add virt kvm unix-char source=/dev/kvm && \
lxc profile device add virt vhost-net unix-char source=/dev/vhost-net && \
lxc profile device add virt vhost-vsock unix-char source=/dev/vhost-vsock && \
lxc profile device add virt vsock unix-char source=/dev/vsock

Launch two or more containers with this profile, such as:

lxc launch ubuntu:jammy hv1 -p virt -p default
lxc launch ubuntu:jammy hv2 -p virt -p default

Use lxc shell to enter both containers and run:

lxd init --auto
lxc launch images:ubuntu/bionic/cloud bionic --vm

Expected results

Both virtual machines should be created and started.

Actual results

The second VM to be created will fail to start. lxc info --show-log local:bionic will display:

[...]
qemu-system-x86_64:/var/snap/lxd/common/lxd/logs/bionic/qemu.conf:115: vhost-vsock: unable to set guest cid: Address already in use

Additional Information

There was an attempt to address this in PR #10216. However, this fix seems to assume that the vsock IDs are only shared with the parent, not peer containers.

libvirt seems to avoid this problem by iterating over usable IDs until a free ID is found.

tomponline commented 1 year ago

Yes, this is the track I started to go down with https://github.com/lxc/lxd/pull/10216#issuecomment-1093320228

mpontillo commented 1 year ago

I like where you were going with this, @tomponline. A few thoughts on the commit you referenced:

I doubt a hypervisor would have a number of virtual machines approaching anywhere close to the size of a 32-bit integer, and the call to check whether or not the ID is already in use should be relatively fast.
- Rather than picking 10 IDs and then giving up (which could go badly if the user gets unlucky), why not make the loop less bounded? I wouldn't give up until checking at least a few thousand. The libvirt code I referenced never gives you up - which is arguably a bug, so hopefully it doesn't let you down.
- Instead of picking values totally at random, why not create a hash using something unique about the VM and start trying IDs there? (Or maybe just take the first 16 bits of the VM UUID, shift them to the most significant bytes of the vsock ID, and try 2^16 iterations from there? That would also have the advantage of being very unlikely to clash with other tooling on the system, such as libvirt, which starts counting from 3.)
Don't forget that IDs 0-2 seem to be reserved and unusable.
I'm curious now how the kernel keeps track of the IDs, and if it is at all useful to have low-numbered IDs instead of random values. (probably an investigation for another day)

Anis-cpu-13 commented 1 year ago

Hello, I am writing to express my interest in working on the issue mentioned in the bug report #11508 for LXD. As a student at the university, I am eager to contribute to open source projects and gain valuable experience in software development.

I have experience in Linux systems and I am familiar with virtualization technologies. I believe that my skills and knowledge would be useful in resolving the issue mentioned in the bug report. I am willing to work with the LXD development team to find a solution to the problem and contribute to the project.

Thank you for considering my interest in this issue. I look forward to hearing back from you.

tomponline commented 1 year ago

Thanks @Anis-cpu-13 assigned to you!

Gio2241 commented 1 year ago

I think I have exactly same issue by peer containers causing vsock ID conflict when I try to launch a QEMU VM within containers.

What's current stage on the fix?

Is there a workaround/hack to fix it for now?

@tomponline

Gio2241 commented 1 year ago

@stgraber , this issue is one which stops our company to move to LXD, is there any workaround before the fix? Thought 5.14 would fix this one :/

tomponline commented 1 year ago

@kochia7 what is the use case for running VMs inside containers? (it maybe there is a workaround in the short term until this is fixed).

Gio2241 commented 1 year ago

@kochia7 what is the use case for running VMs inside containers? (it maybe there is a workaround in the short term until this is fixed).

We have Android Machines (Qemu/CrossVM) running per container, unfortunately we are not able to run VMs directly on the machine, so planning to use LXC containers as lightweight isolation for each Qemu VM.

tomponline commented 1 year ago

Thanks. What is the reason for "we are not able to run VMs directly on the machine"?

And are you aware that by passing through non-namespaced devices like /dev/kvm, you are potentially exposing you host to attacks from the containers, so just want to check you are aware that doing that reduces the isolation. I wondered what sort of isolation you are expecting from running VMs inside containers?

Gio2241 commented 1 year ago

VMM we using create artifacts which interfere each other with other machines, it's how the VMM is build, unable to work with several VM in parallel. So LXC creates just enough isolation to make them work

tomponline commented 1 year ago

You can set volatile.vsock_id on the instance before starting it, so if you're able to set them to non-conflicting IDs, then you can workaround the problem for now.

lxc config set <VM instance> volatile.vsock_id=n

tomponline commented 1 year ago

I'm not sure though how it will work with LXD's own vsock listener though (as opposed to the lxd-agent's). It maybe that vsock just won't work properly when being run inside containers.

Gio2241 commented 1 year ago

You can set volatile.vsock_id on the instance before starting it, so if you're able to set them to non-conflicting IDs, then you can workaround the problem for now.
lxc config set <VM instance> volatile.vsock_id=n

I will give it a try, seems promising! Thanks!

Gio2241 commented 1 year ago

I used lxc config set <VM instance> volatile.vsock_id=n for LXC containers from the host and guest-cid/-vsock_guest_cid for Qemu within the containers for nested VMs and it worked!

tomponline commented 1 year ago

Only the last one would have done anything as setting volatile.vsock_id on a container doesn't do anything.

Gio2241 commented 1 year ago

You can set volatile.vsock_id on the instance before starting it, so if you're able to set them to non-conflicting IDs, then you can workaround the problem for now.
lxc config set <VM instance> volatile.vsock_id=n

Didn't really work for LXD: https://github.com/lxc/lxd/issues/11739#issuecomment-1567389272

canonical / lxd

Nested LXD virtualization in peer containers causes vsock ID conflicts #11508

Required information

Issue description

Steps to reproduce

Expected results

Actual results

Additional Information