containers / fuse-overlayfs

FUSE implementation for overlayfs
GNU General Public License v2.0
502 stars 83 forks source link

overlay driver is very slow in podman-in-podman builds with large COPY layer #401

Closed Jared-Sprague closed 10 months ago

Jared-Sprague commented 10 months ago

Issue Description

Hello! TLDR; overlay driver with a COPY layer that is trying to copy a large number of files & bytes in a nested container build podman-in-podman is so slow as to be unusable.

When using overlay storage driver to do a nested podman build inside a running container, it takes over an hour to run a COPY layer that is coping over 130k files and 1.3gb. Whereas when running the exact same build with the vfs driver it completes in 1.5 minutes.

Doing some analysis it appears that fuse-overlayfs transfers a large chunk of data right at the beginning then slows way down until it is only transferring a trickle of bytes/s. It can take a very long time to complete (> 1 hour) and then sometimes the build errors out when trying to write the last layer.

In contrast vfs driver transfers the bytes at a consistent rate from start to finish so the build completes in a reasonable amount of time, just around 2 minutes.

I have a reproducer image for overlay as well the exact same image only with the storage driver set to vfs that works for contrast. See the reproduction steps below.

Steps to reproduce the issue

To help make reproducing this issue easy, I have uploaded two images that contain a script called nested_build.sh that run the nested build and demonstrates the issue. The base images are based on FROM quay.io/fedora/fedora:38 and contain a src/html/ directory contains a statically generated website that has > 130k files and around 1.3gb in size. The nested image build is based on registry.redhat.io/ubi9/httpd-24 and the all it does is COPY html /var/www/html

You can see all the base image Container files in the repo here: https://github.com/Jared-Sprague/overlay_bug

Steps to reproduce the issue

  1. Run the working control example that uses vfs it should take about 2 min to complete:
    podman run --privileged --user root --rm -it quay.io/offline/overlay-bug:vfs-base src/nested_build.sh

    It ran successfully if you see the localhost/website image output:

    REPOSITORY                                TAG         IMAGE ID      CREATED         SIZE
    localhost/website                         latest      49a856e9d904  11 seconds ago  1.84 GB
    registry.access.redhat.com/ubi9/httpd-24  latest      42aa79cce3ea  4 days ago      377 MB
  2. Now try the bugged example that is exactly the same as step 1, only using overlay:
    podman run --privileged --user root --rm -it quay.io/offline/overlay-bug:overlay-base src/nested_build.sh

Describe the results you received

The results are that the vfs driver completes in ~2min where as the overlay driver takes > 1 hour or doesn't complete at all.

Describe the results you expected

overlay driver should be at least as fast as vfs driver in nested builds.

podman info output

Host system podman info:

host:
  arch: amd64
  buildahVersion: 1.31.2
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 96.41
    systemPercent: 1.26
    userPercent: 2.34
  cpus: 16
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: workstation
    version: "38"
  eventLogger: journald
  freeLocks: 2048
  hostname: fedora
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
  kernel: 6.4.12-200.fc38.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 31441510400
  memTotal: 67286757376
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.7.0-1.fc38.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.7.0
    package: netavark-1.7.0-1.fc38.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.7.0
  ociRuntime:
    name: crun
    package: crun-1.8.7-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.7
      commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20230823.ga7e4bfb-1.fc38.x86_64
    version: |
      pasta 0^20230823.ga7e4bfb-1.fc38.x86_64
      Copyright Red Hat
      GNU Affero GPL version 3 or later <https://www.gnu.org/licenses/agpl-3.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    path: /run/user/1000/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.1-1.fc38.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 3h 6m 2.00s (Approximately 0.12 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /home/jsprague/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/jsprague/.local/share/containers/storage
  graphRootAllocated: 496256417792
  graphRootUsed: 26049777664
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 22
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /home/jsprague/.local/share/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1691705273
  BuiltTime: Thu Aug 10 18:07:53 2023
  GitCommit: ""
  GoVersion: go1.20.7
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.1

Parent container podman info

 host:
  arch: amd64
  buildahVersion: 1.31.2
  cgroupControllers: []
  cgroupManager: cgroupfs
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.7-2.fc38.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.7, commit: '
  cpuUtilization:
    idlePercent: 96.43
    systemPercent: 1.25
    userPercent: 2.32
  cpus: 16
  databaseBackend: boltdb
  distribution:
    distribution: fedora
    variant: container
    version: "38"
  eventLogger: file
  freeLocks: 2048
  hostname: 7d7483d45344
  idMappings:
    gidmap: null
    uidmap: null
  kernel: 6.4.12-200.fc38.x86_64
  linkmode: dynamic
  logDriver: k8s-file
  memFree: 31392792576
  memTotal: 67286757376
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.7.0-1.fc38.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.7.0
    package: netavark-1.7.0-1.fc38.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.7.0
  ociRuntime:
    name: crun
    package: crun-1.8.7-1.fc38.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.8.7
      commit: 53a9996ce82d1ee818349bdcc64797a1fa0433c4
      rundir: /run/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20230823.ga7e4bfb-1.fc38.x86_64
    version: |
      pasta 0^20230823.ga7e4bfb-1.fc38.x86_64
      Copyright Red Hat
      GNU Affero GPL version 3 or later <https://www.gnu.org/licenses/agpl-3.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    path: /run/podman/podman.sock
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: false
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: false
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.2.1-1.fc38.x86_64
    version: |-
      slirp4netns version 1.2.1
      commit: 09e31e92fa3d2a1d3ca261adaeb012c8d75a8194
      libslirp: 4.7.0
      SLIRP_CONFIG_VERSION_MAX: 4
      libseccomp: 2.5.3
  swapFree: 8589930496
  swapTotal: 8589930496
  uptime: 3h 8m 41.00s (Approximately 0.12 days)
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
  - quay.io
store:
  configFile: /usr/share/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions:
    overlay.mountopt: nodev,metacopy=on
  graphRoot: /var/lib/containers/storage
  graphRootAllocated: 496256417792
  graphRootUsed: 26050031616
  graphStatus:
    Backing Filesystem: overlayfs
    Native Overlay Diff: "false"
    Supports d_type: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 0
  runRoot: /run/containers/storage
  transientStore: false
  volumePath: /var/lib/containers/storage/volumes
version:
  APIVersion: 4.6.1
  Built: 1691705273
  BuiltTime: Thu Aug 10 22:07:53 2023
  GitCommit: ""
  GoVersion: go1.20.7
  Os: linux
  OsArch: linux/amd64
  Version: 4.6.1


### Podman in a container

Yes

### Privileged Or Rootless

Privileged

### Upstream Latest Release

Yes

### Additional environment details

Host system is running Fedora 38 & podman v4.6.1 
CPU: x86_64 Intel i7 
hard drive: m.2 nvme

### Additional information

Reproducer repo that has the reproducer base image Container files, as well as the src/nested_build.sh script:
https://github.com/Jared-Sprague/overlay_bug
flouthoc commented 10 months ago

I am not sure how advisable is it to use overlay inside overlay, usually its not and people see all sorts of issues running in this configuration but it seems you have already investigated this and issue looks with the fuse-overlay itself. @giuseppe might be able to suggest something here.

giuseppe commented 10 months ago

I'll take a look, but I'd say that is somehow expected with fuse-overlayfs. FUSE significantly slows down I/O operations and your image is made of many small files (it takes more than a minute on XFS+LVM+LUKS). VFS instead, once the layer is extracted, is much faster than FUSE and even more than native overlay.

overlay on top of overlay doesn't work, but if you use a volume for /var/lib/containers/storage, you'd be able to use it instead of fuse-overlayfs.

Have you already tried that?

Jared-Sprague commented 10 months ago

@giuseppe Yes I did try that, but it didn't change anything. Transfer rate remained unchanged in the child container. But this is very easy if you want to try it yourself just use the reproducer I linked in the issue.

giuseppe commented 10 months ago

opened a PR to address the performance issue in fuse-overlayfs: https://github.com/containers/fuse-overlayfs/pull/402

giuseppe commented 10 months ago

v1.13 is out

Jared-Sprague commented 8 months ago

@giuseppe Thank you so much! I can't wait to test it out!