pkramme commented 1 month ago

Required information

Distribution: Ubuntu
Distribution version: 20.04.5

The output of "snap list --all lxd core20 core22 core24 snapd":

Name    Version      Rev    Tracking       Publisher   Notes
core22  20240809     1586   latest/stable  canonical✓  base,disabled
core22  20240823     1612   latest/stable  canonical✓  base
lxd     6.1-efad198  29943  latest/stable  canonical✓  disabled
lxd     6.1-78a3d8f  30130  latest/stable  canonical✓  -
snapd   2.62         21465  latest/stable  canonical✓  snapd,disabled
snapd   2.63         21759  latest/stable  canonical✓  snapd

The output of "lxc info" or if that fails:

config:
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- backup_compression
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- snapshot_schedule_aliases
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
- resources_network_firmware
- backup_compression_algorithm
- ceph_data_pool_name
- container_syscall_intercept_mount
- compression_squashfs
- container_raw_mount
- container_nic_routed
- container_syscall_intercept_mount_fuse
- container_disk_ceph
- virtual-machines
- image_profiles
- clustering_architecture
- resources_disk_id
- storage_lvm_stripes
- vm_boot_priority
- unix_hotplug_devices
- api_filtering
- instance_nic_network
- clustering_sizing
- firewall_driver
- projects_limits
- container_syscall_intercept_hugetlbfs
- limits_hugepages
- container_nic_routed_gateway
- projects_restrictions
- custom_volume_snapshot_expiry
- volume_snapshot_scheduling
- trust_ca_certificates
- snapshot_disk_usage
- clustering_edit_roles
- container_nic_routed_host_address
- container_nic_ipvlan_gateway
- resources_usb_pci
- resources_cpu_threads_numa
- resources_cpu_core_die
- api_os
- container_nic_routed_host_table
- container_nic_ipvlan_host_table
- container_nic_ipvlan_mode
- resources_system
- images_push_relay
- network_dns_search
- container_nic_routed_limits
- instance_nic_bridged_vlan
- network_state_bond_bridge
- usedby_consistency
- custom_block_volumes
- clustering_failure_domains
- resources_gpu_mdev
- console_vga_type
- projects_limits_disk
- network_type_macvlan
- network_type_sriov
- container_syscall_intercept_bpf_devices
- network_type_ovn
- projects_networks
- projects_networks_restricted_uplinks
- custom_volume_backup
- backup_override_name
- storage_rsync_compression
- network_type_physical
- network_ovn_external_subnets
- network_ovn_nat
- network_ovn_external_routes_remove
- tpm_device_type
- storage_zfs_clone_copy_rebase
- gpu_mdev
- resources_pci_iommu
- resources_network_usb
- resources_disk_address
- network_physical_ovn_ingress_mode
- network_ovn_dhcp
- network_physical_routes_anycast
- projects_limits_instances
- network_state_vlan
- instance_nic_bridged_port_isolation
- instance_bulk_state_change
- network_gvrp
- instance_pool_move
- gpu_sriov
- pci_device_type
- storage_volume_state
- network_acl
- migration_stateful
- disk_state_quota
- storage_ceph_features
- projects_compression
- projects_images_remote_cache_expiry
- certificate_project
- network_ovn_acl
- projects_images_auto_update
- projects_restricted_cluster_target
- images_default_architecture
- network_ovn_acl_defaults
- gpu_mig
- project_usage
- network_bridge_acl
- warnings
- projects_restricted_backups_and_snapshots
- clustering_join_token
- clustering_description
- server_trusted_proxy
- clustering_update_cert
- storage_api_project
- server_instance_driver_operational
- server_supported_storage_drivers
- event_lifecycle_requestor_address
- resources_gpu_usb
- clustering_evacuation
- network_ovn_nat_address
- network_bgp
- network_forward
- custom_volume_refresh
- network_counters_errors_dropped
- metrics
- image_source_project
- clustering_config
- network_peer
- linux_sysctl
- network_dns
- ovn_nic_acceleration
- certificate_self_renewal
- instance_project_move
- storage_volume_project_move
- cloud_init
- network_dns_nat
- database_leader
- instance_all_projects
- clustering_groups
- ceph_rbd_du
- instance_get_full
- qemu_metrics
- gpu_mig_uuid
- event_project
- clustering_evacuation_live
- instance_allow_inconsistent_copy
- network_state_ovn
- storage_volume_api_filtering
- image_restrictions
- storage_zfs_export
- network_dns_records
- storage_zfs_reserve_space
- network_acl_log
- storage_zfs_blocksize
- metrics_cpu_seconds
- instance_snapshot_never
- certificate_token
- instance_nic_routed_neighbor_probe
- event_hub
- agent_nic_config
- projects_restricted_intercept
- metrics_authentication
- images_target_project
- cluster_migration_inconsistent_copy
- cluster_ovn_chassis
- container_syscall_intercept_sched_setscheduler
- storage_lvm_thinpool_metadata_size
- storage_volume_state_total
- instance_file_head
- instances_nic_host_name
- image_copy_profile
- container_syscall_intercept_sysinfo
- clustering_evacuation_mode
- resources_pci_vpd
- qemu_raw_conf
- storage_cephfs_fscache
- network_load_balancer
- vsock_api
- instance_ready_state
- network_bgp_holdtime
- storage_volumes_all_projects
- metrics_memory_oom_total
- storage_buckets
- storage_buckets_create_credentials
- metrics_cpu_effective_total
- projects_networks_restricted_access
- storage_buckets_local
- loki
- acme
- internal_metrics
- cluster_join_token_expiry
- remote_token_expiry
- init_preseed
- storage_volumes_created_at
- cpu_hotplug
- projects_networks_zones
- network_txqueuelen
- cluster_member_state
- instances_placement_scriptlet
- storage_pool_source_wipe
- zfs_block_mode
- instance_generation_id
- disk_io_cache
- amd_sev
- storage_pool_loop_resize
- migration_vm_live
- ovn_nic_nesting
- oidc
- network_ovn_l3only
- ovn_nic_acceleration_vdpa
- cluster_healing
- instances_state_total
- auth_user
- security_csm
- instances_rebuild
- numa_cpu_placement
- custom_volume_iso
- network_allocations
- storage_api_remote_volume_snapshot_copy
- zfs_delegate
- operations_get_query_all_projects
- metadata_configuration
- syslog_socket
- event_lifecycle_name_and_project
- instances_nic_limits_priority
- disk_initial_volume_configuration
- operation_wait
- cluster_internal_custom_volume_copy
- disk_io_bus
- storage_cephfs_create_missing
- instance_move_config
- ovn_ssl_config
- init_preseed_storage_volumes
- metrics_instances_count
- server_instance_type_info
- resources_disk_mounted
- server_version_lts
- oidc_groups_claim
- loki_config_instance
- storage_volatile_uuid
- import_instance_devices
- instances_uefi_vars
- instances_migration_stateful
- container_syscall_filtering_allow_deny_syntax
- access_management
- vm_disk_io_limits
- storage_volumes_all
- instances_files_modify_permissions
- image_restriction_nesting
- container_syscall_intercept_finit_module
- device_usb_serial
- network_allocate_external_ips
- explicit_trust_token
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
auth_user_name: root
auth_user_method: unix
environment:
  addresses:
  architectures:
  - x86_64
  - i686
  driver: lxc | qemu
  driver_version: 6.0.0 | 8.2.1
  instance_types:
  - container
  - virtual-machine
  firewall: nftables
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    idmapped_mounts: "true"
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    seccomp_listener_continue: "true"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 6.6.1+441-dmf
  lxc_features:
    cgroup2: "true"
    core_scheduling: "true"
    devpts_fd: "true"
    idmapped_mounts_v2: "true"
    mount_injection_file: "true"
    network_gateway_device_route: "true"
    network_ipvlan: "true"
    network_l2proxy: "true"
    network_phys_macvlan_mtu: "true"
    network_veth_router: "true"
    pidfd: "true"
    seccomp_allow_deny_syntax: "true"
    seccomp_notify: "true"
    seccomp_proxy_send_notify_fd: "true"
  os_name: Ubuntu
  os_version: "20.04"
  project: default
  server: lxd
  server_clustered: false
  server_event_mode: full-mesh
  server_name: hyper4
  server_pid: 113133
  server_version: "6.1"
  server_lts: false
  storage: btrfs
  storage_version: 5.16.2
  storage_supported_drivers:
  - name: btrfs
    version: 5.16.2
    remote: false
  - name: ceph
    version: 17.2.7
    remote: true
  - name: cephfs
    version: 17.2.7
    remote: true
  - name: cephobject
    version: 17.2.7
    remote: true
  - name: dir
    version: "1"
    remote: false
  - name: lvm
    version: 2.03.11(2) (2021-01-08) / 1.02.175 (2021-01-08) / 4.48.0
    remote: false
  - name: powerflex
    version: 1.16 (nvme-cli)
    remote: true

Issue description

The introduction of automatic core scheduling has led to significant decrease in performance in our infrastructure, with weird problems that make no sense if you are not aware of this issue, such as:

sudden drop in performance after vm creation or other insignificant events operators aren't looking at when diagnosing performance issues local to one VM
spikes in steal time that make no sense such as 25% persistent steal time
generally unpredictable performance where hypervisor load and vm stealtime are not related at all (or at least not outright visible if you are not tracing core scheduling decisions)

The current CPU scheduler doesn't seem to understand hardware topology, which is really surprising, considering that many new CPUs are now asymmetric and that on the kernel side much work is being done making sure that workload is put on "the best core for the job" with features like AMD Preferred Core and equivalents or new CPU schedulers like EEVDF.

It seems weird to put these placement decisions in LXD and turn them on by default, without offswitch when LXD does not consider that this might cause significant problems. LXD simply has not enough data, and static round robin placement is simply too simple.

From our perspective this is a significant design error for this feature, and we ask that this feature is either

reworked so that hardware topology is accurately picked up, including L3 cache differences, CCD layouts, preferred core data, etc
enhanced with an option to turn it completely off
turned off by default
removed.

Additionaly, snaps auto update mechanism has introduced this new feature to our infrastructure (which by itself is fine), and we'd ask you to consider that features like this are being continously applied to real workloads and while not being LTS, should still be at least not harmful.

Information to attach

Our current hardware topology has two L3 domains that have different sizes. Our VMs run typical web application workloads. The core load balancing has put multiple CPU bound cores on one physical core, leading to the weird stealtime above.

# lstopo
Machine (125GB total)
  Package L#0
    NUMANode L#0 (P#0 125GB)
    L3 L#0 (96MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0
        PU L#0 (P#0)
        PU L#1 (P#16)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1
        PU L#2 (P#1)
        PU L#3 (P#17)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2
        PU L#4 (P#2)
        PU L#5 (P#18)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3
        PU L#6 (P#3)
        PU L#7 (P#19)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4
        PU L#8 (P#4)
        PU L#9 (P#20)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5
        PU L#10 (P#5)
        PU L#11 (P#21)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6
        PU L#12 (P#6)
        PU L#13 (P#22)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7
        PU L#14 (P#7)
        PU L#15 (P#23)
    L3 L#1 (32MB)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8
        PU L#16 (P#8)
        PU L#17 (P#24)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9
        PU L#18 (P#9)
        PU L#19 (P#25)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10
        PU L#20 (P#10)
        PU L#21 (P#26)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11
        PU L#22 (P#11)
        PU L#23 (P#27)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12
        PU L#24 (P#12)
        PU L#25 (P#28)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13
        PU L#26 (P#13)
        PU L#27 (P#29)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14
        PU L#28 (P#14)
        PU L#29 (P#30)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15
        PU L#30 (P#15)
        PU L#31 (P#31)

pkramme commented 1 month ago

We threw together a quick script to visualize that problem:

0:  seg18-app1, seg18-mysql1
1:  seg17-app1, seg18-lb1
2:  seg19-app1, seg19-redis1
3:  seg19-mysql1
4:  seg17-app1
5:  seg17-mysql1, seg18-redis1
6:  seg18-redis1, seg19-mysql1
7:  seg19-app1, seg19-mysql1
8:  seg18-app1, seg19-app1
9:  seg18-app1, seg19-app1
10: seg17-lb1, seg18-mysql1
11: seg19-mysql1
12: seg19-app1, seg19-redis1
13: seg18-app1, seg18-mysql1
14: seg17-app1, seg17-redis1
15: seg18-mysql1
16: seg19-mysql1
17: seg19-app1, seg19-lb1
18: seg17-app1, seg17-redis1
19: seg17-mysql1, seg19-app1
20: seg18-app1, seg18-mysql1
21: seg19-mysql1
22: seg18-app1, seg18-mysql1
23: seg19-mysql1
24: seg19-mysql1
25: seg17-lb1, seg18-mysql1
26: seg18-mysql1
27: seg17-app1, seg18-app1
28: seg17-app1, seg18-app1
29: seg17-app1, seg19-app1
30: seg17-app1, seg18-lb1
31: seg19-lb1

Cores:

0-7, 16-23 are fast
8-15, 24-31 are slow
16-31 are the same physical core to $index-16

General rule with this system is:

seg17 has pratically no usage
seg18 and seg19 are really important

This core placement puts very latency critical systems on the same (hyper)core, while leaving systems that have no real load on their own core. Even if this was a completely symmetrical CPU and even if all of those cores weren't hypercores, this would still waste resources when those VMs aren't equally loaded.

tomponline commented 1 month ago

Thanks for your detailed report!

Yeah this was an area of concern originally:

Note: On systems that have mixed performance and efficiency cores (P+E) you may find that VM performance is decreased due to the way LXD now pins some of the VM’s vCPUs to efficiency cores rather than letting the Linux scheduler dynamically schedule them. You can use the explicit CPU pinning feature if needed to avoid this.

https://discourse.ubuntu.com/t/lxd-6-1-has-been-released/46259#vm-automatic-core-pinning-load-balancing

But we are considering option 2 and 3 of your suggestions.

pkramme commented 1 month ago

Thank you for your quick response! Would it be possible to get a patch for the 6.1 series that would give us the option to turn this off? Otherwise we'd have to write tooling to repin the VMs based on things like stealtime or cpu pressure. We'd much rather just let the kernel handle it.

tomponline commented 1 month ago

Thank you for your quick response! Would it be possible to get a patch for the 6.1 series that would give us the option to turn this off? Otherwise we'd have to write tooling to repin the VMs based on things like stealtime or cpu pressure. We'd much rather just let the kernel handle it.

The latest 5.21/stable LTS series does not have this feature (on purpose because it changes the default behaviour) so you could try that. Its more suitable for production purposes anyway as the latest/stable channel is the moving feature release channel and doesn't support downgrades.

See https://documentation.ubuntu.com/lxd/en/latest/installing/#installing-release

The 6.1 release won't get patches now, but will be replaced by 6.2, 6.3 etc. Hopefully we can land the new settings in one of those 2 releases, but 6.2 is imminent and might not make it in there.

tomponline commented 1 month ago

Chatting with @morphis and he proposes having a new setting:

limits.cpu.pin_strategy=[none|auto]

Where none will disable auto pinning (the new default) and auto would be the current default behaviour for 6.1.

pkramme commented 1 month ago

This would be great. We've begun reverting to 5.21, but having the option to disable this is great, especially when we'd eventually use the next LTS release or when staying on the feature branches. Thanks a lot for your work so far @tomponline and @kadinsayani!

canonical / lxd

VM CPU auto pinning causes slowdowns and stealtime #14133

Required information

Issue description

Information to attach