clearcontainers / runtime

OCI (Open Containers Initiative) compatible runtime using Virtual Machines
Apache License 2.0
590 stars 70 forks source link

changing the p9fs options for i/o performance #1095

Open zeigerpuppy opened 6 years ago

zeigerpuppy commented 6 years ago

Description of problem

tuning the p9fs block size and cache modes allows significant performance boost (10x) in KVM. Is there a process for setting these in cc-runtime?

Example

In KVM, I have found two options that significantly increase IO (see below).

This is on a server with a raidz2 ZFS array with 8x Micron 9100 1.2TB NVMe SSDs. There's plenty of head room, Raw IO on this array is around 3GB/sec/process plateauing at about 30GB/sec for 20 processes in iozone3.

With KVM guests using plan9 file system, it looks possible to get about 1GB/sec per CPU but we're getting only about 130MB/sec with clear containers and bind-mounted storage.

Host:

as the filesystem (ZFS) is consistent on the host by design, it's safe to use the passthough mode

<filesystem type='mount' accessmode='passthrough'>
   <source dir='/export/to/guest'/>
   <target dir='mount_tag'/>
</filesystem>

Client

in the mount options, adjusting the msize (packet payload in bytes) and disabling the client cache has a huge effect on I/O

msize=524288,cache=none

Actual result

With my standard KVM clients on the same host, I get about 1GB/sec/process (measured with iozone3). With the cc-runtime backed docker storage (bind mounts) I get only 130MB/sec.

Any help in setting these options in the cc-runtime options would be great as they are critical to good performance.


Settings output

[Runtime]
  Debug = false
  [Runtime.Version]
    Semver = "3.0.23"
    Commit = "64d2226"
    OCI = "1.0.1"
  [Runtime.Config]
    Path = "/usr/share/defaults/clear-containers/configuration.toml"

[Hypervisor]
  MachineType = "pc"
  Version = "QEMU emulator version 2.7.1(2.7.1+git.d4a337fe91-11.cc), Copyright (c) 2003-2016 Fabrice Bellard and the QEMU Project developers"
  Path = "/usr/bin/qemu-lite-system-x86_64"
  Debug = false
  BlockDeviceDriver = "virtio-scsi"

[Image]
  Path = "/usr/share/clear-containers/cc-20640-agent-6f6e9e.img"

[Kernel]
  Path = "/usr/share/clear-containers/vmlinuz-4.14.22-86.container"
  Parameters = ""

[Proxy]
  Type = "ccProxy"
  Version = "Version: 3.0.23+git.3cebe5e"
  Path = "/usr/libexec/clear-containers/cc-proxy"
  Debug = false

[Shim]
  Type = "ccShim"
  Version = "shim version: 3.0.23 (commit: 205ecf7)"
  Path = "/usr/libexec/clear-containers/cc-shim"
  Debug = false

[Agent]
  Type = "hyperstart"
  Version = "<<unknown>>"

[Host]
  Kernel = "4.9.0-6-amd64"
  Architecture = "amd64"
  VMContainerCapable = true
  [Host.Distro]
    Name = "Debian GNU/Linux"
    Version = "9"
  [Host.CPU]
    Vendor = "GenuineIntel"
    Model = "Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz"

Runtime config files

Runtime default config files

/usr/share/defaults/clear-containers/configuration.toml
/usr/share/defaults/clear-containers/configuration.toml

Runtime config file contents

Output of "cat "/etc/clear-containers/configuration.toml"":

# XXX: WARNING: this file is auto-generated.
# XXX:
# XXX: Source file: "config/configuration.toml.in"
# XXX: Project:
# XXX:   Name: Intel® Clear Containers
# XXX:   Type: cc

[hypervisor.qemu]
path = "/usr/bin/qemu-lite-system-x86_64"
kernel = "/usr/share/clear-containers/vmlinuz.container"
image = "/usr/share/clear-containers/clear-containers.img"
machine_type = "pc"

# Optional space-separated list of options to pass to the guest kernel.
# For example, use `kernel_params = "vsyscall=emulate"` if you are having
# trouble running pre-2.15 glibc.
#
# WARNING: - any parameter specified here will take priority over the default
# parameter value of the same name used to start the virtual machine.
# Do not set values here unless you understand the impact of doing so as you
# may stop the virtual machine from booting.
# To see the list of default parameters, enable hypervisor debug, create a
# container and look for 'default-kernel-parameters' log entries.
kernel_params = ""

# Path to the firmware.
# If you want that qemu uses the default firmware leave this option empty
firmware = ""

# Machine accelerators
# comma-separated list of machine accelerators to pass to the hypervisor.
# For example, `machine_accelerators = "nosmm,nosmbus,nosata,nopit,static-prt,nofw"`
machine_accelerators=""

# Default number of vCPUs per POD/VM:
# unspecified or 0                --> will be set to 1
# < 0                             --> will be set to the actual number of physical cores
# > 0 <= number of physical cores --> will be set to the specified number
# > number of physical cores      --> will be set to the actual number of physical cores
default_vcpus = 1

# Bridges can be used to hot plug devices.
# Limitations:
# * Currently only pci bridges are supported
# * Until 30 devices per bridge can be hot plugged.
# * Until 5 PCI bridges can be cold plugged per VM.
#   This limitation could be a bug in qemu or in the kernel
# Default number of bridges per POD/VM:
# unspecified or 0   --> will be set to 1
# > 1 <= 5           --> will be set to the specified number
# > 5                --> will be set to 5
default_bridges = 1

# Default memory size in MiB for POD/VM.
# If unspecified then it will be set 2048 MiB.
#default_memory = 2048

# Disable block device from being used for a container's rootfs.
# In case of a storage driver like devicemapper where a container's 
# root file system is backed by a block device, the block device is passed
# directly to the hypervisor for performance reasons. 
# This flag prevents the block device from being passed to the hypervisor, 
# 9pfs is used instead to pass the rootfs.
disable_block_device_use = false

# Block storage driver to be used for the hypervisor in case the container
# rootfs is backed by a block device. This is either virtio-scsi or 
# virtio-blk.
block_device_driver = "virtio-scsi"

# Enable pre allocation of VM RAM, default false
# Enabling this will result in lower container density
# as all of the memory will be allocated and locked
# This is useful when you want to reserve all the memory
# upfront or in the cases where you want memory latencies
# to be very predictable
# Default false
#enable_mem_prealloc = true

# Enable huge pages for VM RAM, default false
# Enabling this will result in the VM memory
# being allocated using huge pages.
# This is useful when you want to use vhost-user network
# stacks within the container. This will automatically 
# result in memory pre allocation
#enable_hugepages = true

# Enable swap of vm memory. Default false.
# The behaviour is undefined if mem_prealloc is also set to true
#enable_swap = true

# This option changes the default hypervisor and kernel parameters
# to enable debug output where available. This extra output is added
# to the proxy logs, but only when proxy debug is also enabled.
# 
# Default false
#enable_debug = true

# Disable the customizations done in the runtime when it detects
# that it is running on top a VMM. This will result in the runtime
# behaving as it would when running on bare metal.
# 
#disable_nesting_checks = true

[proxy.cc]
path = "/usr/libexec/clear-containers/cc-proxy"

# If enabled, proxy messages will be sent to the system log
# (default: disabled)
#enable_debug = true

[shim.cc]
path = "/usr/libexec/clear-containers/cc-shim"

# If enabled, shim messages will be sent to the system log
# (default: disabled)
#enable_debug = true

[agent.cc]
# There is no field for this section. The goal is only to be able to
# specify which type of agent the user wants to use.

[runtime]
# If enabled, the runtime will log additional debug messages to the
# system log
# (default: disabled)
#enable_debug = true
#
# Internetworking model
# Determines how the VM should be connected to the
# the container network interface
# Options:
#
#   - bridged
#     Uses a linux bridge to interconnect the container interface to
#     the VM. Works for most cases except macvlan and ipvlan.
#
#   - macvtap
#     Used when the Container network interface can be bridged using
#     macvtap.
internetworking_model="bridged"

Output of "cat "/usr/share/defaults/clear-containers/configuration.toml"":

# XXX: WARNING: this file is auto-generated.
# XXX:
# XXX: Source file: "config/configuration.toml.in"
# XXX: Project:
# XXX:   Name: Intel® Clear Containers
# XXX:   Type: cc

[hypervisor.qemu]
path = "/usr/bin/qemu-lite-system-x86_64"
kernel = "/usr/share/clear-containers/vmlinuz.container"
image = "/usr/share/clear-containers/clear-containers.img"
machine_type = "pc"

# Optional space-separated list of options to pass to the guest kernel.
# For example, use `kernel_params = "vsyscall=emulate"` if you are having
# trouble running pre-2.15 glibc.
#
# WARNING: - any parameter specified here will take priority over the default
# parameter value of the same name used to start the virtual machine.
# Do not set values here unless you understand the impact of doing so as you
# may stop the virtual machine from booting.
# To see the list of default parameters, enable hypervisor debug, create a
# container and look for 'default-kernel-parameters' log entries.
kernel_params = ""

# Path to the firmware.
# If you want that qemu uses the default firmware leave this option empty
firmware = ""

# Machine accelerators
# comma-separated list of machine accelerators to pass to the hypervisor.
# For example, `machine_accelerators = "nosmm,nosmbus,nosata,nopit,static-prt,nofw"`
machine_accelerators=""

# Default number of vCPUs per POD/VM:
# unspecified or 0                --> will be set to 1
# < 0                             --> will be set to the actual number of physical cores
# > 0 <= number of physical cores --> will be set to the specified number
# > number of physical cores      --> will be set to the actual number of physical cores
default_vcpus = 1

# Bridges can be used to hot plug devices.
# Limitations:
# * Currently only pci bridges are supported
# * Until 30 devices per bridge can be hot plugged.
# * Until 5 PCI bridges can be cold plugged per VM.
#   This limitation could be a bug in qemu or in the kernel
# Default number of bridges per POD/VM:
# unspecified or 0   --> will be set to 1
# > 1 <= 5           --> will be set to the specified number
# > 5                --> will be set to 5
default_bridges = 1

# Default memory size in MiB for POD/VM.
# If unspecified then it will be set 2048 MiB.
#default_memory = 2048

# Disable block device from being used for a container's rootfs.
# In case of a storage driver like devicemapper where a container's 
# root file system is backed by a block device, the block device is passed
# directly to the hypervisor for performance reasons. 
# This flag prevents the block device from being passed to the hypervisor, 
# 9pfs is used instead to pass the rootfs.
disable_block_device_use = false

# Block storage driver to be used for the hypervisor in case the container
# rootfs is backed by a block device. This is either virtio-scsi or 
# virtio-blk.
block_device_driver = "virtio-scsi"

# Enable pre allocation of VM RAM, default false
# Enabling this will result in lower container density
# as all of the memory will be allocated and locked
# This is useful when you want to reserve all the memory
# upfront or in the cases where you want memory latencies
# to be very predictable
# Default false
#enable_mem_prealloc = true

# Enable huge pages for VM RAM, default false
# Enabling this will result in the VM memory
# being allocated using huge pages.
# This is useful when you want to use vhost-user network
# stacks within the container. This will automatically 
# result in memory pre allocation
#enable_hugepages = true

# Enable swap of vm memory. Default false.
# The behaviour is undefined if mem_prealloc is also set to true
#enable_swap = true

# This option changes the default hypervisor and kernel parameters
# to enable debug output where available. This extra output is added
# to the proxy logs, but only when proxy debug is also enabled.
# 
# Default false
#enable_debug = true

# Disable the customizations done in the runtime when it detects
# that it is running on top a VMM. This will result in the runtime
# behaving as it would when running on bare metal.
# 
#disable_nesting_checks = true

[proxy.cc]
path = "/usr/libexec/clear-containers/cc-proxy"

# If enabled, proxy messages will be sent to the system log
# (default: disabled)
#enable_debug = true

[shim.cc]
path = "/usr/libexec/clear-containers/cc-shim"

# If enabled, shim messages will be sent to the system log
# (default: disabled)
#enable_debug = true

[agent.cc]
# There is no field for this section. The goal is only to be able to
# specify which type of agent the user wants to use.

[runtime]
# If enabled, the runtime will log additional debug messages to the
# system log
# (default: disabled)
#enable_debug = true
#
# Internetworking model
# Determines how the VM should be connected to the
# the container network interface
# Options:
#
#   - bridged
#     Uses a linux bridge to interconnect the container interface to
#     the VM. Works for most cases except macvlan and ipvlan.
#
#   - macvtap
#     Used when the Container network interface can be bridged using
#     macvtap.
internetworking_model="bridged"

Agent

version:

unknown

Logfiles

Runtime logs

/usr/bin/cc-collect-data.sh: line 242: journalctl: command not found No recent runtime problems found in system journal.

Proxy logs

/usr/bin/cc-collect-data.sh: line 242: journalctl: command not found No recent proxy problems found in system journal.

Shim logs

/usr/bin/cc-collect-data.sh: line 242: journalctl: command not found No recent shim problems found in system journal.


Container manager details

Have docker

Docker

Output of "docker version":

Client:
 Version:   17.12.0-ce
 API version:   1.35
 Go version:    go1.9.2
 Git commit:    c97c6d6
 Built: Wed Dec 27 20:11:19 2017
 OS/Arch:   linux/amd64

Server:
 Engine:
  Version:  17.12.0-ce
  API version:  1.35 (minimum version 1.12)
  Go version:   go1.9.2
  Git commit:   c97c6d6
  Built:    Wed Dec 27 20:09:54 2017
  OS/Arch:  linux/amd64
  Experimental: false

Output of "docker info":

Containers: 3
 Running: 3
 Paused: 0
 Stopped: 0
Images: 4
Server Version: 17.12.0-ce
Storage Driver: zfs
 Zpool: error while getting pool information strconv.ParseUint: parsing "": invalid syntax
 Zpool Health: not available
 Parent Dataset: zpool2/docker
 Space Used By Parent: 3466528128
 Space Available: 3851731992320
 Parent Quota: no
 Compression: lz4
Logging Driver: json-file
Cgroup Driver: cgroupfs
Plugins:
 Volume: local
 Network: bridge host macvlan null overlay
 Log: awslogs fluentd gcplogs gelf journald json-file logentries splunk syslog
Swarm: inactive
Runtimes: cc-runtime runc
Default Runtime: cc-runtime
Init Binary: docker-init
containerd version: 89623f28b87a6004d4b785663257362d1658a729
runc version: 64d2226 (expected: b2567b37d7b75eb4cf325b77297b140ea686ce8f)
init version: 949e6fa
Security Options:
 seccomp
  Profile: default
Kernel Version: 4.9.0-6-amd64
Operating System: Debian GNU/Linux 9 (stretch)
OSType: linux
Architecture: x86_64
CPUs: 40
Total Memory: 376.6GiB
Name: <redacted>
ID: <redacted>
Docker Root Dir: /var/lib/docker
Debug Mode (client): false
Debug Mode (server): true
 File Descriptors: 42
 Goroutines: 54
 System Time: 2018-04-13T19:42:44.065725616+10:00
 EventsListeners: 0
Registry: https://index.docker.io/v1/
Labels:
Experimental: false
Insecure Registries:
 127.0.0.0/8
Live Restore Enabled: false

WARNING: No swap limit support

Output of "systemctl show docker":

/usr/bin/cc-collect-data.sh: line 167: systemctl: command not found

No kubectl


Packages

Have dpkg Output of "dpkg -l|egrep "(cc-oci-runtime|cc-proxy|cc-runtime|cc-shim|kata-proxy|kata-runtime|kata-shim|clear-containers-image|linux-container|qemu-lite|qemu-system-x86)"":

ii  cc-proxy                             3.0.23+git.3cebe5e-27                   amd64        
ii  cc-runtime                           3.0.23+git.64d2226-27                   amd64        
ii  cc-runtime-bin                       3.0.23+git.64d2226-27                   amd64        
ii  cc-runtime-config                    3.0.23+git.64d2226-27                   amd64        
ii  cc-shim                              3.0.23+git.205ecf7-27                   amd64        
ii  clear-containers-image               20640-48                                amd64        Clear containers image
ii  linux-container                      4.14.22-86                              amd64        linux kernel optimised for container-like workloads.
ii  qemu-lite                            2.7.1+git.d4a337fe91-11                 amd64        linux kernel optimised for container-like workloads.
ii  qemu-system-x86                      1:2.8+dfsg-6+deb9u3                     amd64        QEMU full system emulation binaries (x86)

Have rpm Output of "rpm -qa|egrep "(cc-oci-runtime|cc-proxy|cc-runtime|cc-shim|kata-proxy|kata-runtime|kata-shim|clear-containers-image|linux-container|qemu-lite|qemu-system-x86)"":

zeigerpuppy commented 6 years ago

p.s. docker has been all sorts of fun to get going on this setup. I had to downgrade from v18.03 to 17.12.0 as containers were not stopping properly. Please ignore the error in the docker info output ( Zpool: error while getting pool information strconv.ParseUint: parsing "": invalid syntax). It arises because we're using a ZFS dataset within a pool rather than a dedicated pool. I don't think this has implications for the docker ZFS implementation apart from failing on zpool info commands.

Also, we chose to use Debian Stretch without systemd on the server so some commands that call systemd specifically may fail. The install went well with a small tweak for the cc-proxy deb installer and I don't think this has any implications for IO

grahamwhaley commented 6 years ago

Hi @zeigerpuppy Good question. 9p msize has been discussed before, as has the cache mode a bit. Have a look at https://github.com/clearcontainers/hyperstart/pull/25 for the discussion around a PR to set msize. I think that got stuck as nobody had time to run exhaustive tests across different block size transfers etc. to get data on if it improved all situations, and what the memory footprint overhead might be etc.

And then I raised a very related item early this week for kata-containers: https://github.com/kata-containers/runtime/issues/201

Right now we don't have a way in either Clear or Kata containers to adjust/tweak/add those settings to the mounts without having to rebuild either the agent (Clear) or runtime (Kata). Yes, it would be good to have at least maybe a developer mode option in the toml config file to allow such things to be tweaked.

Both/either of those Issues will show you where and how you could add the extra options if you wanted to do a build and experiment.

Also, iirc, enabling cacheing on 9p is something that needs careful consideration. iirc, the original design of 9p basically said 'do not cache' - but, I think we have experimented with this before, and as long as the constrained situation is understood, I think we could enable some form of cacheing. @chao-p @rarindam for more thoughts and input. @bergwolf and @gnawux for visibility, relevance to kata, and any input etc.

amshinde commented 6 years ago

@grahamwhaley Yeah we need to have a config option in our toml file for 9p msize, that way it will atleast be convenient to try out different msizes before we settle on a optimal default one without having to rebuild the runtime. I'll raise a PR for that.

zeigerpuppy commented 6 years ago

Good to hear that it'll get some consideration, options in the toml file would be great. Please let me know if I can help with testing. Also, for the meantime, I was wondering if there's any way to manually tweak these options in a built clear container?

sboeuf commented 6 years ago

@zeigerpuppy take a look at @amshinde PR here: https://github.com/kata-containers/runtime/pull/207

amshinde commented 6 years ago

@zeigerpuppy https://github.com/kata-containers/runtime/pull/207 is now merged. You can now try kata-runtime with the ability to configure 9p msizes for a container. It will be great if you could help out with the testing. @grahamwhaley Can you provide details about various parameters that we need to take into consideration for testing this out.

zeigerpuppy commented 6 years ago

@amshinde, thanks for the details. I am a little behind in bug chasing so may be a little while until I can do a build. In the meantime I found an interesting way to restore performance....

Previously I was using cc-runtime with the following file stack (all on Debian Stretch without systemd):

  1. backing file system ZFS -> docker -> cc-runtime using file mapping

Unfortunatley, docker's implementation of ZFS is pretty basic and seems like they've just adapted the overlay driver. This is a real shame as ZFS is a natural fit when zvols are used. The main problem I found with this stack was poor performance, but also MongoDB containers failed to work at all, I presume because it couldn't properly memory map to the filesystem.

performance, as stated above was only about 130MB/s

  1. ZFS -> sparse ZVOL -> thin provisioned LVM-> docker devicemapper -> cc-runtime with virtio-blk driver

This setup looks much better, there is now proper block usage and it's sparsely provisioned throughout. I can snapshot directly on the ZVOL or at the LVM level. MongoDB works again and I/O performance is more like 1.3GB/s.

Now the strange bit, I mapped an external volume with the docker config:

docker run -it --mount type=bind,source=/zpool1/vmdata/test,target=/test --name iozone threadx/iozone

Now, I presume this is still using a 9p mapping but performance is great (approx 1GB/s read/write).

So, for the moment, I plan to stick with this config. However, I will try to give the kata runtime a go once I've migrated a whole lot of VMs....

ps. if you're using LVM in Debian Stretch, watch out for this bug which prevents re-attaching of LVM volumes at boot by default.