High memory usage when pulling large images with pre-existing layers.

ver4a commented 1 week ago

Issue Description

podman version:

Client:       Podman Engine
Version:      5.2.5
API Version:  5.2.5
Go Version:   go1.23.2
Built:        Fri Oct 18 02:00:00 2024
OS/Arch:      linux/amd64

When pulling a large image, podman allocates memory seemingly equal to image size. For a 13G quay.io/fedora-ostree-desktops/kinoite:41 podman allocated ~13G of memory.

Steps to reproduce the issue

Steps to reproduce the issue (with rootless podman)

podman pull quay.io/fedora-ostree-desktops/kinoite:41

I can't reproduce this with rootful podman, that only floats around 100M.

Describe the results you received

Memory usage scaling with image size.

Describe the results you expected

Memory usage being more or less constant.

podman info output

host:
  arch: amd64
  buildahVersion: 1.37.5
  cgroupControllers:
  - cpu
  - io
  - memory
  - pids
  cgroupManager: systemd
  cgroupVersion: v2
  conmon:
    package: conmon-2.1.12-3.fc41.x86_64
    path: /usr/bin/conmon
    version: 'conmon version 2.1.12, commit: '
  cpuUtilization:
    idlePercent: 85.46
    systemPercent: 7.71
    userPercent: 6.83
  cpus: 4
  databaseBackend: sqlite
  distribution:
    distribution: aureolin
    variant: kde
    version: "41"
  eventLogger: journald
  freeLocks: 2048
  hostname: fedora
  idMappings:
    gidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
    uidmap:
    - container_id: 0
      host_id: 1000
      size: 1
    - container_id: 1
      host_id: 524288
      size: 65536
  kernel: 6.11.6-300.fc41.x86_64
  linkmode: dynamic
  logDriver: journald
  memFree: 7408332800
  memTotal: 16636030976
  networkBackend: netavark
  networkBackendInfo:
    backend: netavark
    dns:
      package: aardvark-dns-1.13.0-1.fc41.x86_64
      path: /usr/libexec/podman/aardvark-dns
      version: aardvark-dns 1.13.0
    package: netavark-1.13.0-1.fc41.x86_64
    path: /usr/libexec/podman/netavark
    version: netavark 1.13.0
  ociRuntime:
    name: crun
    package: crun-1.18.1-1.fc41.x86_64
    path: /usr/bin/crun
    version: |-
      crun version 1.18.1
      commit: c41f034fdbb9742c395085fc98459c94ad1f9aae
      rundir: /run/user/1000/crun
      spec: 1.0.0
      +SYSTEMD +SELINUX +APPARMOR +CAP +SECCOMP +EBPF +CRIU +LIBKRUN +WASM:wasmedge +YAJL
  os: linux
  pasta:
    executable: /usr/bin/pasta
    package: passt-0^20241030.gee7d0b6-1.fc41.x86_64
    version: |
      pasta 0^20241030.gee7d0b6-1.fc41.x86_64
      Copyright Red Hat
      GNU General Public License, version 2 or later
        <https://www.gnu.org/licenses/old-licenses/gpl-2.0.html>
      This is free software: you are free to change and redistribute it.
      There is NO WARRANTY, to the extent permitted by law.
  remoteSocket:
    exists: false
    path: /run/user/1000/podman/podman.sock
  rootlessNetworkCmd: pasta
  security:
    apparmorEnabled: false
    capabilities: CAP_CHOWN,CAP_DAC_OVERRIDE,CAP_FOWNER,CAP_FSETID,CAP_KILL,CAP_NET_BIND_SERVICE,CAP_SETFCAP,CAP_SETGID,CAP_SETPCAP,CAP_SETUID,CAP_SYS_CHROOT
    rootless: true
    seccompEnabled: true
    seccompProfilePath: /usr/share/containers/seccomp.json
    selinuxEnabled: true
  serviceIsRemote: false
  slirp4netns:
    executable: /usr/bin/slirp4netns
    package: slirp4netns-1.3.1-1.fc41.x86_64
    version: |-
      slirp4netns version 1.3.1
      commit: e5e368c4f5db6ae75c2fce786e31eef9da6bf236
      libslirp: 4.8.0
      SLIRP_CONFIG_VERSION_MAX: 5
      libseccomp: 2.5.5
  swapFree: 31947223040
  swapTotal: 33271312384
  uptime: 1h 17m 33.00s (Approximately 0.04 days)
  variant: ""
plugins:
  authorization: null
  log:
  - k8s-file
  - none
  - passthrough
  - journald
  network:
  - bridge
  - macvlan
  - ipvlan
  volume:
  - local
registries:
  search:
  - registry.fedoraproject.org
  - registry.access.redhat.com
  - docker.io
store:
  configFile: /var/home/ver4a/.config/containers/storage.conf
  containerStore:
    number: 0
    paused: 0
    running: 0
    stopped: 0
  graphDriverName: overlay
  graphOptions: {}
  graphRoot: /home/simon/.local/share/containers/storage
  graphRootAllocated: 498387124224
  graphRootUsed: 195880693760
  graphStatus:
    Backing Filesystem: btrfs
    Native Overlay Diff: "true"
    Supports d_type: "true"
    Supports shifting: "false"
    Supports volatile: "true"
    Using metacopy: "false"
  imageCopyTmpDir: /var/tmp
  imageStore:
    number: 19
  runRoot: /run/user/1000/containers
  transientStore: false
  volumePath: /var/home/ver4a/.local/share/containers/storage/volumes
version:
  APIVersion: 5.2.5
  Built: 1729209600
  BuiltTime: Fri Oct 18 02:00:00 2024
  GitCommit: ""
  GoVersion: go1.23.2
  Os: linux
  OsArch: linux/amd64
  Version: 5.2.5

Podman in a container

No

Privileged Or Rootless

Rootless

Upstream Latest Release

Yes

Additional environment details

No response

Additional information

I'm able to reproduce this on Fedora Kinoite 41 and FCOS (stable) 40.20241019.3.0

Luap99 commented 1 week ago

Do you have TMPDIR set and does it point to a tmpfs file system? Downloaded image content will be written to the TMPDIR /var/tmp by default and removed once the pull is complete AFAIK

ver4a commented 1 week ago

I just checked and I don't set TMPDIR on any of the machines. Also the memory used is allocated for the podman process itself, I've measured it using cgroups through a systemd.scope unit, tmpfs would show as part of the page cache if I recall correctly.

ver4a commented 1 week ago

The memory starts rising only after the download completes at stage "Writing manifest to image destination" and grows progressively until it hits roughly the image size (i think), then it finishes.

Here's systemctl status of the scope (I have only 16G of memory, so it swapped):

Memory: 3G (peak: 13.1G swap: 4K swap peak: 3.1G)

I ran it like this:

systemd-run --user --collect --scope sh -c "podman pull quay.io/fedora-ostree-desktops/kinoite:41 ; sleep infinity"

Luap99 commented 1 week ago

There is https://github.com/containers/storage/issues/2055 for zstd compressed images but this image does seem to use gzip.

There is a hidden --memory-profile somefile option that could be used to capture a memory profile that we could look at to see where it is allocating

@giuseppe @mtrmac Ideas?

mtrmac commented 1 week ago

Yes, I’d like to see a profile. If we are doing anything stupid, that should very clearly show up in there at 13 GB.

(The description “page cache” doesn’t immediately tell me whether this is private memory required by Podman, or just that we have written that many files and they will eventually be written and then the page cache can be freed. I’m sure that’s trivial knowledge that can be looked up.)

ver4a commented 1 week ago

I probably should have split that into it's own paragraph. What I meant to say is that it's not tmpfs, since tmpfs isn't memory allocated to any single process, but rather is part of the page cache, which is separate from process memory.

I have it profiled (got to only 6G), where should I paste it?

Luap99 commented 1 week ago

you can upload the file here in a github comment, either drag and drop or click below the text box

ver4a commented 1 week ago

pull-profile-1.txt (this got to the full 13G, I reran the profiling with a different image before, so this is with pulling quay.io/fedora-ostree-desktops/kinoite:41, for clarity)

ver4a commented 1 week ago

So, I've figured out how to reproduce it from scratch and have a rough theory of how it happens.

If you just download a large image like this from scratch, you're not going to see the 13G memory usage, you need to already have an image stored locally that shares a significant amount of blobs with the pulled image.

(complete speculation) If I had to hazard a guess, it's got something to do with the deduplication of blobs, that they get loaded into memory for checksumming or something and then stick around a bit too long for some reason.

Here's how I reproduce it now (the first image is FROM the kinoite one, so it already contains all its layers):

podman pull git.uncontrol.me/ver4a/onc-kde:main
podman pull quay.io/fedora-ostree-desktops/kinoite:41

mtrmac commented 1 week ago

The first one is 66 zstd:chunked layers… there might be something to the deduplication theory?

The second one is 65 gzip layers. Is that really that, or some mirrored / converted version?

Any non-default c/storage options? In particular, is enable_partial_images, and convert_images, set?

ver4a commented 1 week ago

The second is literally quay.io/fedora-ostree-desktops/kinoite:41, I run the commands exactly as I wrote them.

I haven't changed anything about enable_partial_images or convert_images (or any other storage related options), if it's not default on Fedora, it's unset. (I don't even know how to list these particular options)

EDIT: I think I misunderstood you, the first one is "FROM quay.io/fedora-ostree-desktops/kinoite:41", but it is rebuilt with zstd:chunked and some other stuff. (base image + the 1 squashed layer) Dockerfile: https://git.uncontrol.me/ver4a/ostree-native-containers/src/branch/main/Dockerfile.kde Build: https://git.uncontrol.me/ver4a/ostree-native-containers/src/branch/main/.forgejo/workflows/build-image.yaml podman build . -f Dockerfile.kde --no-cache --pull=always --squash podman push --compression-format=zstd:chunked --compression-level=1

mtrmac commented 1 week ago

Note to self: In the profile,

    6.26GB 34.09% 34.09%     6.26GB 34.09%  os.ReadFile
    6.26GB 34.08% 68.17%     6.26GB 34.08%  github.com/containers/storage.copyImageBigDataOptionSlice
    1.94GB 10.58% 78.76%     1.94GB 10.58%  io.(*multiReader).WriteTo
    0.60GB  3.26% 82.01%     0.60GB  3.26%  encoding/json.(*decodeState).literalStore

That does look rather unexpected. Allocating 6 GB for Image.BigData would certainly explain a lot, but why? storageImageDestination.CommitWithOptions misconstructing dataBlobs??!

mtrmac commented 1 week ago

@ver4a Thanks!

Unconfirmed, just by reading the code:

I think you are right, this does happen when a whole layer is reused from a previously-pulled image (but reused in a way which requires making a copy); there must be a consistent reproducer, but it’s not too likely to happen just on common pulls — but with some combination of pushes, compression format changes, and and re-pulls, I can certainly imagine a situation where this happens very frequently.

In particular, the hypothesis is that since https://github.com/containers/image/commit/5567453b6bff6047669a15f37d7e93c94031e915 we have started pointing at temporary uncompressed versions of the layer data in s.filenames, but the code to compute dataBlobs doesn’t expect to have any “extra” items there, and puts them into ImageBigData; and that happens to be implemented by reading the data all into memory and writing it to a file again.

And it’s not just the memory usage, we actually unnecessarily store that 6 GB on disk (until the image is deleted).

We should just drastically simplify this, and instead of recording all unaccounted-for blobs, special-case the one blob where that can happen, the config.

ver4a commented 1 week ago

Correct me if I've missed something, but I actually think the reproducer seems pretty simple and the case when this happens also seems very easy to hit (to a smaller degree though, since more layers are different). It also doesn't seem to require any special combination of actions, although I've admittedly not tried running it now on a completely fresh system (I did remove all containers and images from both podman and buildah), I'll spin up a new CoreOS VM in a bit to verify further.

Take for example this which I've just tried:

podman pull quay.io/fedora-ostree-desktops/kinoite:41.20241111.0.ceab3e17
podman pull quay.io/fedora-ostree-desktops/kinoite:41.20241112.0.c88f8a1b

This got to 6G of memory usage.

These two are basically the same image, but they were built 1 day apart, so some layers have changed. The case here is a simple update. Now yes, most users don't upgrade daily (I do), but I hit it because I rebuild my images every day and so I pull a new upstream image to replace an ever so slightly out of date one.

Update: I've just reproduced it on a fresh CoreOS install with no customizations, just by pulling these two images in this order. Update2: I verified it with rootful podman, so I removed the extraneous bit about rootless podman from title.

ver4a commented 1 week ago

I ran the test on multiple versions of FCOS:

40.20240416.3.1 (podman 5.0.1): reproduces 39.20240407.3.0 (podman 4.9.4): doesn't reproduce

I've tried several older versions and those also don't have the issue, so it's somewhere between 4.9.4 and 5.0.1

Update: Now I see it, podman 5.0.0 bumped containers/image to v5.30.0, which was the first tag to contain that commit, I guess you were spot-on.

mtrmac commented 1 week ago

podman pull quay.io/fedora-ostree-desktops/kinoite:41.20241111.0.ceab3e17

podman pull quay.io/fedora-ostree-desktops/kinoite:41.20241112.0.c88f8a1b

The bug depends on using the same content, but not exactly in the same “layer position”, e.g. if something was a layer 5 previously and now it is a layer 6 — or if it is still a layer 5, but a parent layer 4 has changed.

… and, now that I think about it, yes, the “parent layer has changed” situation can happen very often.

Update: I've just reproduced it on a fresh CoreOS install with no customizations, just by pulling these two images in this order.

Perfect, that narrows it down, thank you.

mtrmac commented 4 days ago

https://github.com/containers/image/pull/2636 might be a fix, but I didn’t test that yet.

mtrmac commented 2 days ago

Confirming the guess at the cause, and that https://github.com/containers/image/pull/2636 fixes this.

Given the reproducer above, before|after of the second pull:

User time (seconds): 112.18               | User time (seconds): 113.42
System time (seconds): 147.68                 | System time (seconds): 112.29
Percent of CPU this job got: 64%              | Percent of CPU this job got: 70%
Elapsed (wall clock) time (h:mm:ss or m:ss): 6:42.42  | Elapsed (wall clock) time (h:mm:ss or m:ss): 5:22.25
Maximum resident set size (kbytes): 2966040       | Maximum resident set size (kbytes): 150000
Major (requiring I/O) page faults: 1436487        | Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2003489       | Minor (reclaiming a frame) page faults: 404133
Voluntary context switches: 4004946           | Voluntary context switches: 4073314
Involuntary context switches: 2753177             | Involuntary context switches: 2809797
File system inputs: 22055440                  | File system inputs: 6103488
File system outputs: 28445336                 | File system outputs: 22406208

containers / podman