Closed markus2330 closed 2 months ago
As 0x6178656c wrote in https://github.com/ElektraInitiative/libelektra/pull/4620#issuecomment-1295395453:
We are regularly running into "no space left" problems because of too many Docker images, so I tagged it as urgent and removed the "probably to be removed".
@mpranj any other suggestions other than the two Docker images above?
One thing I noticed about our images is that they are very big. Maybe we can look into making them smaller, that should help with the disk space problems.
We are regularly running into "no space left" problems because of too many Docker images, so I tagged it as urgent and removed the "probably to be removed".
I think this will not save any space.
AFAIK: removing unused images will do nothing to our disk space usage, as the images are not built by the pipeline. The images are build only when needed. We are actively using many images, so that is a problem.
One thing I noticed about our images is that they are very big.
Would be great if we can do something about this.
Maybe we can add the docker build option --squash
to avoid storing multiple layers of the filesystem.
There are always pros and cons, but it's worth a shot.
Maybe we can add the docker build option
--squash
to avoid storing multiple layers of the filesystem.
Wouldn't that mean different images can't share a layer and all images would have to be built entirely from scratch, if there is the tiniest difference?
The images are build only when needed.
So do we actually build new images for every Jenkins run? Is there any kind of auto-cleanup?
Also, since I don't have access to the CI servers: Are we sure that the docker images are the problem? Could there be something else that is eating disk space too, e.g. log files with long retention periods, or artifacts of old builds?
Wouldn't that mean different images can't share a layer and all images would have to be built entirely from scratch, if there is the tiniest difference?
Yes, but I'll test this now to see if there is any difference. Also, I know this is how it should be true on one machine, but I have a feeling we're not reusing layers anyhow.
Also, since I don't have access to the CI servers: Are we sure that the docker images are the problem?
Yes, pretty sure it is at least the biggest problem. Most other things are cleaned up.
So do we actually build new images for every Jenkins run? Is there any kind of auto-cleanup?
Not for every run, but when they are needed. So images are reused once they are build. They are rebuilt monthly s.t. the packages are updated periodically.
Wouldn't that mean different images can't share a layer and all images would have to be built entirely from scratch, if there is the tiniest difference?
Unfortunately you're right.
I've tested the --squash
option and for the case of the build-elektra-fedora-36
images the difference is only 2.16GB vs 2.03GB.
Okay, how exactly is our Fedora 36 Image over 2GB in size, when the base fedora:36
image is <60MB (see Docker Hub)? There has to be something in there that we don't need...
Another thing we could do: Remove Java from all images except one, maybe even remove it completely from Jenkins and only test on Cirrus. The JVM should be the same everywhere.
AFAIK: removing unused images will do nothing to our disk space usage, as the images are not built by the pipeline. The images are build only when needed.
Yes, this is why I extended the scope of this issue, the idea was to suggest which used Docker images (probably the least important ones) to remove or how to make them smaller.
Another thing we could do: Remove Java from all images except one, maybe even remove it completely from Jenkins and only test on Cirrus. The JVM should be the same everywhere.
Actually especially Java is very prone to problems in CMake detection and similar. So it is good to have these tests across several distributions.
Btw. the issue seems to be not as urgent as I thought. Used disc space is now: 346G used, 1.5T available, i.e. 20% used, so the problem is simply that running docker prune -af
once a month was not enough.
Further suggestions what to reduce nevertheless are welcome. At some point we will need to do the cleanup.
Also, since I don't have access to the CI servers: Are we sure that the docker images are the problem? Could there be something else that is eating disk space too, e.g. log files with long retention periods, or artifacts of old builds?
After running docker prune -af
on a7 the disc space usage goes from 100% to less than 20%.
the idea was to suggest which used Docker images (probably the least important ones) to remove or how to make them smaller.
I see we have 4 different Debian Bullseye images? Why? I get the minimal
image to test without installing dependencies, but the rest are probably wasting space. The same goes for Debian Buster.
Also if docker image prune -af
(or even docker system prune
) cleaned up > 1TB auf space, I would really be interested in what exactly was removed. e.g. docker image ls
before and afterwards would be interesting.
Additionally, we can probably run docker image prune
(without -a
) much more often. It should not remove anything we need.
Btw. the issue seems to be not as urgent as I thought. Used disc space is now: 346G used, 1.5T available, i.e. 20% used, so the problem is simply that running docker prune -af once a month was not enough.
Also if docker image prune -af (or even docker system prune) cleaned up > 1TB auf space
Seriously doubt this happened. Usually it cleans about 100-200GB.
Maybe we should prune -af
weekly? prune -f is run daily, prune -af is run monthly
Note that deleting all images also means that the current ones need to be fetched from our docker registry, which has a rather slow connection.
What machine are you talking about?
On a7
we store the:
What might be a problem: The build agents keep current images which they need. (so far, everything is OK) When a Dockerfile is changed, a new version of this image is built and the build agents retrieve this image. Now we have two versions of this image per build agent. The issue worsens when multiple PRs change images multiple times.
I see we have 4 different Debian Bullseye images? Why?
To also test cmake exclusion of modules. Probably we should make these images build upon each other to use less space?
Maybe we should prune -af weekly?
Yes, sounds like the easiest solution for now. Is there some way to only cleanup the images that weren't used for a week?
What machine are you talking about?
In https://github.com/ElektraInitiative/libelektra/issues/4637#issuecomment-1313475416 I was talking about a7 of the recent incident https://github.com/ElektraInitiative/libelektra/issues/160#issuecomment-1312652971.
To also test cmake exclusion of modules. Probably we should make these images build upon each other to use less space?
Building the images on top of each other would definitely help.
There's probably a few other things we can do. Like reducing the number of RUN
s to reduce layers, or check that we're not installing e.g. some GUIs or other unnecessary stuff.
Is there some way to only cleanup the images that weren't used for a week?
Yes, the --filter
argument can be used with a timestamp. See e.g. this page
So I did a small investigation on the "scripts/docker/fedora/32/Dockerfile" image. I analyzed its layers and most of the size comes from all the packages installed. The whole image is 2.61GB and around 2.4GB are packages.
Package | Size (MB) |
---|---|
golang-bin-1.14.15-3.fc32.x86_64 | 255.98 |
java-11-openjdk-headless-11.0.11.0.9-0.fc32.x86_64 | 170.76 |
java-1.8.0-openjdk-headless-1.8.0.292.b10-0.fc32.x86_64 | 117.47 |
clang-libs-10.0.1-3.fc32.x86_64 | 92.07 |
gcc-10.3.1-1.fc32.x86_64 | 81.71 |
llvm-libs-10.0.1-4.fc32.x86_64 | 78.23 |
glibc-debuginfo-2.31-6.fc32.x86_64 | 76.42 |
mesa-dri-drivers-20.2.3-1.fc32.x86_64 | 65.74 |
glibc-debuginfo-common-2.31-6.fc32.x86_64 | 57.20 |
python27-2.7.18-8.fc32.x86_64 | 54.59 |
dnf install --setopt=install_weak_deps=False
--setopt=install_weak_deps=False: This flag disables the installation of weak dependencies, which can help reduce the number of unnecessary packages installed. Equivalent to --no-install-recommends in apt-get.
Adding this dnf option reduced the image size by ~15%.
Maybe it might be interesting to use some container registry like ghcr.io to reduce duplicate code and build some base images, that other dockerfiles could build upon.
Thank you for the investigation. Yes, please add this option(s).
I mark this stale as it did not have any activity for one year. I'll close it in two weeks if no further activity occurs. If you want it to be alive again, ping by writing a message here or create a new issue with the remainder of this issue. Thank you for your contributions :sparkling_heart:
I closed this now because it has been inactive for more than one year. If I closed it by mistake, please do not hesitate to reopen it or create a new issue with the remainder of this issue. Thank you for your contributions :sparkling_heart: