GoogleContainerTools / kaniko

Build Container Images In Kubernetes
Apache License 2.0
14.7k stars 1.43k forks source link

kaniko build using too many memory #909

Open jyipks opened 4 years ago

jyipks commented 4 years ago

I am building a rather large docker image, end size is ~8GB. It builds fine in DinD, however we would like to use kaniko. The kaniko pod running the dockerfile balloons in memory usage and gets killed by kubernetes. How can I make kaniko work for me, or am I stuck with DinD?

Please help, thank you

tejal29 commented 4 years ago

/cc @priyawadhwa Can we provide users anything to measure the memory usage ?

@jyipks Can you tell us if you have set resource limits in kaniko pod spec Also please tell us cluster specification

priyawadhwa commented 4 years ago

@tejal29 @jyipks the only thing I can think of is upping the resource limits on the pod as well

jyipks commented 4 years ago

i had no resource limits on the kaniko pods. This was on a 3 node cluster, 4cores, 16GB each. From grafana i believe the pod attempted to use more than 15GB I was building a custom jupyter-notebook image, that normally comes out to be ~8GB upon build completion via docker build.

jyipks commented 4 years ago

does kaniko keep everything in memory as its building the image or it writes to a temp directory? if it goes into a temp directory can you please provide it?

thanks

mamoit commented 4 years ago

This sounds like #862. @jyipks do you remember if you were using the --reproducible flag?

jyipks commented 4 years ago

No i've never used that flag before.

rvaidya commented 3 years ago

This also happens when trying to do an npm install - I also have never used that flag before.

max107 commented 3 years ago

same problem

tarakanof commented 3 years ago

Same problem on gitlab runner: latest debian with latest docker. Building 12Mb docker image uses 15Gb - 35Gb of memory.

fknop commented 3 years ago

We're facing the same issue in Gitlab CI custom runner. We're building a docker image for node, the build started hanging on webpack everytime and the machine ends up running out of memory and crashes. It used to work fine without any issue. Our docker image is a little less than 300MB and our machine has 8Gb ram

meseta commented 3 years ago

Similar issue on Gitlab CI on GKE. We're building a python image based on official python base image, it consumes about 12Gb of RAM

jamil-s commented 3 years ago

We're seeing similar issues with gradle builds as well

nichoio commented 3 years ago

Would also like to learn more about this. Kaniko doesn't have a feature equivalent to docker build --memory, does it?

suprememoocow commented 3 years ago

We're seeing similar issues too. For example, this job failed with OOM: https://gitlab.com/gitlab-com/gl-infra/tamland/-/jobs/1405946307

The job includes some stacktrace information, which may help in diagnosing the problem.

The parameters that we were using, including --snapshotMode=redo are here: https://gitlab.com/gitlab-com/gl-infra/tamland/-/commit/0b399381d30655059ec78461640674af7562c708#587d266bb27a4dc3022bbed44dfa19849df3044c_116_125

mikesir87 commented 3 years ago

I'm having the same problem as well. But, in my case, it's a Java-based build and the Maven cache repo is being included as an ignore-path. The number of changes that should occur outside of that are fairly minimal, yet I'm easily seeing 5+ GB of RAM being used where the build before that was using at most 1.2GB. We'd love to be able to use smaller instances for our builds.

trallnag commented 2 years ago

I rolled back to 1.3.0 from 1.6.0 and now it seems to work again

Phylu commented 2 years ago

This should be closed in the 1.7.0 release as of #1722.

s3f4 commented 2 years ago

I rolled back to 1.3.0 from 1.6.0 and now it seems to work again

1.7 has a gcloud credentials problem, rolling back to 1.3.0 worked.

Exagone313 commented 2 years ago

Do you know when the tag gcr.io/kaniko-project/executor:debug (as well as :latest) gets updated? It still points to the v1.6.0 version: https://console.cloud.google.com/gcr/images/kaniko-project/GLOBAL/executor

Zachu commented 2 years ago

I was also experiencing a memory issues in the last part of the image building with v1.7.0.

INFO[0380] Taking snapshot of full filesystem...        
Killed

I tried all kinds of combinations with --compressed-caching=false and removing the --reproducible flag, downgrading to v1.3.0 and stuff. I finally got the build to pass by using the --use-new-run flag.

--use-new-run

Use the experimental run implementation for detecting changes without requiring file system snapshots. In some cases, this may improve build performance by 75%.

So I guess you should put that into your toolbox while banging your head against the wall :)

Idok-viber commented 1 year ago

Also got this issue when building with v1.9.1.

INFO[0133] Taking snapshot of full filesystem...        
Killed

reverted back to v1.3.0 and it works.

cforce commented 1 year ago

I am using 1.9.0 and it seems to eat quite some memory. With or without --compressed-caching=false, --use-new-run same issue sporadically "The node was low on resource: memory. Container build was using 5384444Ki, which exceeds its request of 0. Container helper was using 24720Ki, which exceeds its request of 0. "The node was low on resource: memory. Container helper was using 9704Ki, which exceeds its request of 0. Container build was using 6871272Ki, which exceeds its request of 0."

7 GB to build an simple image ? The memory consumption is ridiculous. Why does the same with standard docker just works with 1x40 x less memory request?

gaatjeniksaan commented 1 year ago

Reiterating what I stated in https://github.com/GoogleContainerTools/kaniko/issues/2275 as well:

We're having this issue as well with 1.9.1-debug. End size of the image should be ~9GB, but the kaniko build (on GKE) fails due to limit in memory. See attached image to share in my agony. image (5)

tamer-hassan commented 1 year ago

Had this issue with kaniko v1.8.0-debug, also tried v1.3.0-debug, same issue. killed or evicted pod due to memory pressure on the (previously idle) node. This was the case when building an image nearly 2.5GB large, with the --cache=true flag.

Solution for me was to use v1.9.2-debug with the following options: --cache=true --compressed-caching=false --use-new-run --cleanup

Further advise (from research of other previous issues): DO NOT use the flags --single-snapshot or --cache-copy-layers

codezart commented 1 year ago

I've got the same issue. For my case, I'm using a git context and cloning it itself takes 10Gi+ and gets killed before initiating the build on the latest versions. I tried with a node with more than 16Gi and it worked 1 out of 3 times.

cforce commented 1 year ago

Kanicko feel like dead, i propose to switch to podman

jonaskello commented 1 year ago

We have the same problem, get this in gitlab ci:

INFO[0172] Taking snapshot of full filesystem...        
Killed
Cleaning up project directory and file based variables
ERROR: Job failed: command terminated with exit code 137
starkmatt commented 1 year ago

Solution for me was to use v1.9.2-debug with the following options: --cache=true --compressed-caching=false --use-new-run --cleanup

This worked for me, Thank you very much,

FYI For anyone else running into this.

zzzinho commented 1 year ago

I have the same problem in v1.12.1-debug

INFO[0206] Taking snapshot of full filesystem...        
Killed
ricardojdsilva87 commented 1 year ago

Hello everyone just to give my input, here are some CPU/RAM metrics with different kaniko versions.

Just to clarify the container where the build runs is using github actions hosted runners with 2core and 4GB RAM

Picture 1 - kaniko 1.9.2-debug with cache enabled --> Push failed with message Killed image

Picture 2 - kaniko 1.9.2-debug with cache enabled and these settings --compressed-caching=false --use-new-run --cleanup Push failed with message Killed image

Picture 3 - kaniko 1.12.1-debug with cache enabled and these settings --compressed-caching=false --use-new-run --cleanup Push failed with message Killed image

Picture 4 - kaniko 1.3.0-debug with cache enabled (the flag --compressed-caching is not supported in this version) --> Push WORKS image

The resulting image has around 500MB and the container uses around 1 core and less than the memory limit of the container (4GB). This build works if we increase the memory limit to 16GB that is an overkill and a waste of resources. The jobs that are killed are in fact using almost half of the memory (~2GB) of the job that was successful (3GB)

I would say that something broke kaniko starting on version 1.3.0, but even with all the flags set the builds do not work and the memory usage is way less than with v1.3.0 (Update the builds started to fail from version v1.9.1)

Thanks for your help

UPDATE Tested also other older kaniko versions. with kaniko 1.5.2-debug with cache enabled image

with kaniko 1.5.2-debug with cache enabled image

with kaniko 1.6.0-debug with cache enabled image

with kaniko 1.8.1-debug with cache enabled image

with kaniko 1.9.0-debug with cache enabled image

Starting with kaniko v1.9.1 the builds started to fail image

droslean commented 1 year ago

Same here. My build process takes around 1.5-1.8GB of memory to build, but when I run the Dockerfile via kaniko it needs 5GB which is absurd!!!!

Is there any solution here?

cforce commented 1 year ago

I encourage to use podman

droslean commented 1 year ago

@aaron-prindle Any ideas?

timwsuqld commented 1 year ago

I can confirm using 1.3.0 works for us (with --force as we have v2 cgroups), while 1.14.0 fails. I've not tested every version in between. Final image size is 980Mb, build machine has 4GB of ram.

ensc commented 3 months ago

same here with Kaniko version : v1.23.0

Snapshotting itself works but when sending results to the registry (gitlab), executor gets killed

kernel: Out of memory: Killed process 9503 (executor) total-vm:53191912kB, anon-rss:31445632kB, file-rss:128kB, shmem-rss:0kB, UID:0 pgtables:63580kB oom_score_adj:0

Results on registry are around 6 GB, extracted filesystem takes around 14 GB. Last words are

$ . /opt/sdk/environment-setup-cortexa9t2hf-neon-oe-linux-gnueabi
INFO[0342] Taking snapshot of full filesystem...        
INFO[0561] USER build-user:build-user                   
INFO[0561] Cmd: USER                                    

RSS immediately after this output is around 700MB and quickly increases then.

ajbeach2 commented 1 month ago

I am running into this issues as well. kaniko is unusable for gcp cloud build for my use case due to OOM. A 5 year old issue for this is crazy.

For context, I used E2_HIGHCPU_32 which has 32 GB of memory and STILL GET OOM. Granted, my image is large (12GB) but i currently don't have control of the image size.

apinter commented 2 weeks ago

Although I don't reach an OOM, but I do notice an increase in memory usage when building multiple images in the same pod. For context I have a repo with 18 images that sometimes changes at the same time. They are built with a for loop so one image after the other. The more I build the more memory the pod uses. If I spawn out to a new pod then I don't have access to the cache anymore so the previous jobs' content is gone and build fails. Is there a way to clean up the pod after a build? And yes, I'm using the --cleanup flag.