canonical / lxd

Powerful system container and virtual machine manager
https://canonical.com/lxd
GNU Affero General Public License v3.0
4.32k stars 926 forks source link

blkio limits don't get applied on attached BeegFS volume #6277

Closed maran closed 4 years ago

maran commented 4 years ago

Required information

config: {}
api_extensions:
- storage_zfs_remove_snapshots
- container_host_shutdown_timeout
- container_stop_priority
- container_syscall_filtering
- auth_pki
- container_last_used_at
- etag
- patch
- usb_devices
- https_allowed_credentials
- image_compression_algorithm
- directory_manipulation
- container_cpu_time
- storage_zfs_use_refquota
- storage_lvm_mount_options
- network
- profile_usedby
- container_push
- container_exec_recording
- certificate_update
- container_exec_signal_handling
- gpu_devices
- container_image_properties
- migration_progress
- id_map
- network_firewall_filtering
- network_routes
- storage
- file_delete
- file_append
- network_dhcp_expiry
- storage_lvm_vg_rename
- storage_lvm_thinpool_rename
- network_vlan
- image_create_aliases
- container_stateless_copy
- container_only_migration
- storage_zfs_clone_copy
- unix_device_rename
- storage_lvm_use_thinpool
- storage_rsync_bwlimit
- network_vxlan_interface
- storage_btrfs_mount_options
- entity_description
- image_force_refresh
- storage_lvm_lv_resizing
- id_map_base
- file_symlinks
- container_push_target
- network_vlan_physical
- storage_images_delete
- container_edit_metadata
- container_snapshot_stateful_migration
- storage_driver_ceph
- storage_ceph_user_name
- resource_limits
- storage_volatile_initial_source
- storage_ceph_force_osd_reuse
- storage_block_filesystem_btrfs
- resources
- kernel_limits
- storage_api_volume_rename
- macaroon_authentication
- network_sriov
- console
- restrict_devlxd
- migration_pre_copy
- infiniband
- maas_network
- devlxd_events
- proxy
- network_dhcp_gateway
- file_get_symlink
- network_leases
- unix_device_hotplug
- storage_api_local_volume_handling
- operation_description
- clustering
- event_lifecycle
- storage_api_remote_volume_handling
- nvidia_runtime
- container_mount_propagation
- container_backup
- devlxd_images
- container_local_cross_pool_handling
- proxy_unix
- proxy_udp
- clustering_join
- proxy_tcp_udp_multi_port_handling
- network_state
- proxy_unix_dac_properties
- container_protection_delete
- unix_priv_drop
- pprof_http
- proxy_haproxy_protocol
- network_hwaddr
- proxy_nat
- network_nat_order
- container_full
- candid_authentication
- backup_compression
- candid_config
- nvidia_runtime_config
- storage_api_volume_snapshots
- storage_unmapped
- projects
- candid_config_key
- network_vxlan_ttl
- container_incremental_copy
- usb_optional_vendorid
- snapshot_scheduling
- container_copy_project
- clustering_server_address
- clustering_image_replication
- container_protection_shift
- snapshot_expiry
- container_backup_override_pool
- snapshot_expiry_creation
- network_leases_location
- resources_cpu_socket
- resources_gpu
- resources_numa
- kernel_features
- id_map_current
- event_location
- storage_api_remote_volume_snapshots
- network_nat_address
- container_nic_routes
- rbac
- cluster_internal_copy
- seccomp_notify
- lxc_features
- container_nic_ipvlan
- network_vlan_sriov
- storage_cephfs
- container_nic_ipfilter
- resources_v2
- container_exec_user_group_cwd
- container_syscall_intercept
- container_disk_shift
- storage_shifted
- resources_infiniband
- daemon_storage
- instances
- image_types
- resources_disk_sata
- clustering_roles
- images_expiry
api_status: stable
api_version: "1.0"
auth: trusted
public: false
auth_methods:
- tls
environment:
  addresses: []
  architectures:
  - x86_64
  - i686
  certificate: |
    -----BEGIN CERTIFICATE-----
    MIICGzCCAaKgAwIBAgIRAJaAdO+HvgrkGbCq34yKPVcwCgYIKoZIzj0EAwMwPjEc
    MBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEeMBwGA1UEAwwVcm9vdEAxODA0
    LmRldi5ieXNoLm1lMB4XDTE5MTAwNDA3MzUwMVoXDTI5MTAwMTA3MzUwMVowPjEc
    MBoGA1UEChMTbGludXhjb250YWluZXJzLm9yZzEeMBwGA1UEAwwVcm9vdEAxODA0
    LmRldi5ieXNoLm1lMHYwEAYHKoZIzj0CAQYFK4EEACIDYgAEhsMR682NdOy37MmB
    KUFO33ElBdopi0DKHgGntL6KcLT612TVJkY8hIKeQ8Arh7UVfBHfgzeUbQLgoxUO
    Nz+1uzvzOg2euvNc++opFsFIIlifSVba0niQHEDIwJ/vD7gCo2QwYjAOBgNVHQ8B
    Af8EBAMCBaAwEwYDVR0lBAwwCgYIKwYBBQUHAwEwDAYDVR0TAQH/BAIwADAtBgNV
    HREEJjAkghAxODA0LmRldi5ieXNoLm1lhwRf06HChwQKCAABhwSsEQABMAoGCCqG
    SM49BAMDA2cAMGQCMFrqHrF3wytHd5kDsDF/cMTnj1DsjTKB3vgdGyulctL0vF41
    hTjNGdunnBc13GV4MgIwMcEjfgNumzZ17YP8Of1nREbDQ4fVtkNCzHU1Q7L635EG
    spaU9UN60c/FO4PgKUa9
    -----END CERTIFICATE-----
  certificate_fingerprint: 57e6d38c216b0a4c8790940f5b7450a45aa4f68b450c2609c103476c2ac06e32
  driver: lxc
  driver_version: 3.0.3
  kernel: Linux
  kernel_architecture: x86_64
  kernel_features:
    netnsid_getifaddrs: "true"
    seccomp_listener: "true"
    shiftfs: "true"
    uevent_injection: "true"
    unpriv_fscaps: "true"
  kernel_version: 5.0.0-29-generic
  lxc_features:
    mount_injection_file: "false"
    network_gateway_device_route: "false"
    network_ipvlan: "false"
    network_l2proxy: "false"
    network_phys_macvlan_mtu: "false"
    seccomp_notify: "false"
  project: default
  server: lxd
  server_clustered: false
  server_name: 1804.dev.bysh.me
  server_pid: 362974
  server_version: "3.18"
  storage: ceph
  storage_version: ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4)
    mimic (stable)`

Issue description

As discussed here when working with an rbd volumes it appears limits.read/write are not being translated into valid cgroup limits.

Steps to reproduce

  1. Setup a Ceph RBD backend
  2. Create a container and attach a rbd volume to it
  3. Set limits in the config file, in my example:
devices:
  maran-test:
    limits.read: 1MB
    limits.write: 1MB
    path: /mnt/external
    pool: default
    source: maran-test
    type: disk
  1. Expect to see blkio.throttle.write_bps_device or blkio.throttle.read_bps_device have some data, but they are empty.

Let me know how I can help with this.

stgraber commented 4 years ago

So I'm pretty confused as to how limits would actually ever have worked since @tomponline's rework of devices.

The issue is that the Start() function on the device is called prior to the container starting, this generates the list of needed mounts but doesn't mount anything yet. This in turn means that we don't know what device backs the source path yet and so can't ever compute the needed cgroup entries.

I suspect what we need to do is move the limit calculation to a PostHook, which then allows us to inspect the mounted disks. That PostHook should then return a RunConfig with the cgroup entries we expect and we can then have LXD apply them through LXC.

@tomponline does that sound right to you?

tomponline commented 4 years ago

@stgraber the cgroup settings are returned as part of run config by Start() and if being called as part of container start then the cgroup rules are translated into liblxc settings so they are applied when the container actually starts. Here: https://github.com/lxc/lxd/blob/master/lxd/container_lxc.go#L2243-L2251 this is the same technique used for the actual mounts.

I'll take a look and check its working on other disk types to check its not specific to RBD.

stgraber commented 4 years ago

@tomponline the problem is that those rules cannot be generated until a mount entry exists and that mount entry won't exist until RunConfig is applied.

That's why I'm now adding a PostRunConfig which can be filed through PostHooks and get applied after the container has started.

This then allows us to resolve the mounts to block devices and figure out the limits.

stgraber commented 4 years ago

I've got a branch which does this now and that's fixed the limits for the root device here. There's still a problem resolving the mount for devices that are hotplugged though, so looking into those still.