Open ver4a opened 1 week ago
Do you have TMPDIR set and does it point to a tmpfs file system? Downloaded image content will be written to the TMPDIR /var/tmp by default and removed once the pull is complete AFAIK
I just checked and I don't set TMPDIR on any of the machines. Also the memory used is allocated for the podman process itself, I've measured it using cgroups through a systemd.scope unit, tmpfs would show as part of the page cache if I recall correctly.
The memory starts rising only after the download completes at stage "Writing manifest to image destination" and grows progressively until it hits roughly the image size (i think), then it finishes.
Here's systemctl status
of the scope (I have only 16G of memory, so it swapped):
Memory: 3G (peak: 13.1G swap: 4K swap peak: 3.1G)
I ran it like this:
systemd-run --user --collect --scope sh -c "podman pull quay.io/fedora-ostree-desktops/kinoite:41 ; sleep infinity"
There is https://github.com/containers/storage/issues/2055 for zstd compressed images but this image does seem to use gzip.
There is a hidden --memory-profile somefile
option that could be used to capture a memory profile that we could look at to see where it is allocating
@giuseppe @mtrmac Ideas?
Yes, I’d like to see a profile. If we are doing anything stupid, that should very clearly show up in there at 13 GB.
(The description “page cache” doesn’t immediately tell me whether this is private memory required by Podman, or just that we have written that many files and they will eventually be written and then the page cache can be freed. I’m sure that’s trivial knowledge that can be looked up.)
I probably should have split that into it's own paragraph. What I meant to say is that it's not tmpfs, since tmpfs isn't memory allocated to any single process, but rather is part of the page cache, which is separate from process memory.
I have it profiled (got to only 6G), where should I paste it?
you can upload the file here in a github comment, either drag and drop or click below the text box
pull-profile-1.txt (this got to the full 13G, I reran the profiling with a different image before, so this is with pulling quay.io/fedora-ostree-desktops/kinoite:41, for clarity)
So, I've figured out how to reproduce it from scratch and have a rough theory of how it happens.
If you just download a large image like this from scratch, you're not going to see the 13G memory usage, you need to already have an image stored locally that shares a significant amount of blobs with the pulled image.
(complete speculation) If I had to hazard a guess, it's got something to do with the deduplication of blobs, that they get loaded into memory for checksumming or something and then stick around a bit too long for some reason.
Here's how I reproduce it now (the first image is FROM the kinoite one, so it already contains all its layers):
The first one is 66 zstd:chunked layers… there might be something to the deduplication theory?
The second one is 65 gzip layers. Is that really that, or some mirrored / converted version?
Any non-default c/storage options? In particular, is enable_partial_images
, and convert_images
, set?
The second is literally quay.io/fedora-ostree-desktops/kinoite:41
, I run the commands exactly as I wrote them.
I haven't changed anything about enable_partial_images
or convert_images
(or any other storage related options), if it's not default on Fedora, it's unset. (I don't even know how to list these particular options)
EDIT: I think I misunderstood you, the first one is "FROM quay.io/fedora-ostree-desktops/kinoite:41", but it is rebuilt with zstd:chunked and some other stuff. (base image + the 1 squashed layer)
Dockerfile: https://git.uncontrol.me/ver4a/ostree-native-containers/src/branch/main/Dockerfile.kde
Build: https://git.uncontrol.me/ver4a/ostree-native-containers/src/branch/main/.forgejo/workflows/build-image.yaml
podman build . -f Dockerfile.kde --no-cache --pull=always --squash
podman push --compression-format=zstd:chunked --compression-level=1
Note to self: In the profile,
6.26GB 34.09% 34.09% 6.26GB 34.09% os.ReadFile
6.26GB 34.08% 68.17% 6.26GB 34.08% github.com/containers/storage.copyImageBigDataOptionSlice
1.94GB 10.58% 78.76% 1.94GB 10.58% io.(*multiReader).WriteTo
0.60GB 3.26% 82.01% 0.60GB 3.26% encoding/json.(*decodeState).literalStore
That does look rather unexpected. Allocating 6 GB for Image.BigData would certainly explain a lot, but why? storageImageDestination.CommitWithOptions
misconstructing dataBlobs
??!
@ver4a Thanks!
Unconfirmed, just by reading the code:
I think you are right, this does happen when a whole layer is reused from a previously-pulled image (but reused in a way which requires making a copy); there must be a consistent reproducer, but it’s not too likely to happen just on common pulls — but with some combination of pushes, compression format changes, and and re-pulls, I can certainly imagine a situation where this happens very frequently.
In particular, the hypothesis is that since https://github.com/containers/image/commit/5567453b6bff6047669a15f37d7e93c94031e915 we have started pointing at temporary uncompressed versions of the layer data in s.filenames
, but the code to compute dataBlobs
doesn’t expect to have any “extra” items there, and puts them into ImageBigData
; and that happens to be implemented by reading the data all into memory and writing it to a file again.
And it’s not just the memory usage, we actually unnecessarily store that 6 GB on disk (until the image is deleted).
We should just drastically simplify this, and instead of recording all unaccounted-for blobs, special-case the one blob where that can happen, the config.
Correct me if I've missed something, but I actually think the reproducer seems pretty simple and the case when this happens also seems very easy to hit (to a smaller degree though, since more layers are different). It also doesn't seem to require any special combination of actions, although I've admittedly not tried running it now on a completely fresh system (I did remove all containers and images from both podman and buildah), I'll spin up a new CoreOS VM in a bit to verify further.
Take for example this which I've just tried:
This got to 6G of memory usage.
These two are basically the same image, but they were built 1 day apart, so some layers have changed. The case here is a simple update. Now yes, most users don't upgrade daily (I do), but I hit it because I rebuild my images every day and so I pull a new upstream image to replace an ever so slightly out of date one.
Update: I've just reproduced it on a fresh CoreOS install with no customizations, just by pulling these two images in this order. Update2: I verified it with rootful podman, so I removed the extraneous bit about rootless podman from title.
I ran the test on multiple versions of FCOS:
40.20240416.3.1 (podman 5.0.1): reproduces 39.20240407.3.0 (podman 4.9.4): doesn't reproduce
I've tried several older versions and those also don't have the issue, so it's somewhere between 4.9.4 and 5.0.1
Update: Now I see it, podman 5.0.0 bumped containers/image to v5.30.0, which was the first tag to contain that commit, I guess you were spot-on.
- podman pull quay.io/fedora-ostree-desktops/kinoite:41.20241111.0.ceab3e17
- podman pull quay.io/fedora-ostree-desktops/kinoite:41.20241112.0.c88f8a1b
The bug depends on using the same content, but not exactly in the same “layer position”, e.g. if something was a layer 5 previously and now it is a layer 6 — or if it is still a layer 5, but a parent layer 4 has changed.
… and, now that I think about it, yes, the “parent layer has changed” situation can happen very often.
Update: I've just reproduced it on a fresh CoreOS install with no customizations, just by pulling these two images in this order.
Perfect, that narrows it down, thank you.
https://github.com/containers/image/pull/2636 might be a fix, but I didn’t test that yet.
Confirming the guess at the cause, and that https://github.com/containers/image/pull/2636 fixes this.
Given the reproducer above, before|after of the second pull:
User time (seconds): 112.18 | User time (seconds): 113.42
System time (seconds): 147.68 | System time (seconds): 112.29
Percent of CPU this job got: 64% | Percent of CPU this job got: 70%
Elapsed (wall clock) time (h:mm:ss or m:ss): 6:42.42 | Elapsed (wall clock) time (h:mm:ss or m:ss): 5:22.25
Maximum resident set size (kbytes): 2966040 | Maximum resident set size (kbytes): 150000
Major (requiring I/O) page faults: 1436487 | Major (requiring I/O) page faults: 0
Minor (reclaiming a frame) page faults: 2003489 | Minor (reclaiming a frame) page faults: 404133
Voluntary context switches: 4004946 | Voluntary context switches: 4073314
Involuntary context switches: 2753177 | Involuntary context switches: 2809797
File system inputs: 22055440 | File system inputs: 6103488
File system outputs: 28445336 | File system outputs: 22406208
Issue Description
podman version:
When pulling a large image, podman allocates memory seemingly equal to image size. For a 13G
quay.io/fedora-ostree-desktops/kinoite:41
podman allocated ~13G of memory.Steps to reproduce the issue
Steps to reproduce the issue (with rootless podman)
I can't reproduce this with rootful podman, that only floats around 100M.
Describe the results you received
Memory usage scaling with image size.
Describe the results you expected
Memory usage being more or less constant.
podman info output
Podman in a container
No
Privileged Or Rootless
Rootless
Upstream Latest Release
Yes
Additional environment details
No response
Additional information
I'm able to reproduce this on Fedora Kinoite 41 and FCOS (stable) 40.20241019.3.0