Image build process Freezes on `Taking snapshot of full filesystem...`

abhi1git commented 4 years ago

Actual behavior While building image using gcr.io/kaniko-project/executor:debug in gitlab CI runner hosted on kubernetes using helm chart the image build process freezes on Taking snapshot of full filesystem... for the time till the runner timeouts(1 hr) This behaviour is intermittent as for the same project image build stage works sometimes

Issue arises in multistage as well as single stage Dockerfile.

Expected behavior Image build should not freeze at Taking snapshot of full filesystem... and should be successful everytime.

To Reproduce As the behaviour is intermittent not sure how it can be reproduced

Description	Yes/No
Please check if this a new feature you are proposing	- [ ]
Please check if the build works in docker but not in kaniko	- [Yes ]
Please check if this error is seen when you use `--cache` flag	- [ ]
Please check if your dockerfile is a multistage dockerfile	- [ ]

@tejal29

abhi1git commented 4 years ago

Can you please elaborate snapshot for which filesystem is being taken while building image so that we can see if filesystem size is causing this issue. we are using kaniko to build images in gitlab cicd and runner is deployed on kubernetes using helm chart. Preiously this issue used to arise randomly but all of our kaniko build image jobs get freeze on Taking snapshot of full filesystem... @tejal29

tejal29 commented 4 years ago

@abhi1git can you try the newer snapshot modes --snapshotMode=redo?

tejal29 commented 4 years ago

@abhi1git please switch to using --snapshotMode=redo. See comments here https://github.com/GoogleContainerTools/kaniko/issues/1305#issuecomment-672752902

Kiddinglife commented 3 years ago

@abhi1git please switch to using --snapshotMode=redo. See comments here #1305 (comment) I suffered from the same issue and --snapshotMode=redo did not resolve the issue. @abhi1git do you get it work ?

tejal29 commented 3 years ago

@Kiddinglife Can you provide your dockerfile or some stats on number of files in your repo?

leoschet commented 3 years ago

I am experience this problem while building an image with less than a gb. Interesting that it fails silently. GitLab CI job will be marked as successfull but no image is actually pushed.

We are using kaniko for several other projects but this error only happens on two projects. Both are monorepos and use lerna for extending yarn commands to sub packages.

I must say it was working at some point and it does work normally when using docker to build the image

Here is a snippet of the build logs:

INFO[0363] RUN yarn install --network-timeout 100000    
INFO[0363] cmd: /bin/sh                                 
INFO[0363] args: [-c yarn install --network-timeout 100000] 
INFO[0363] util.Lookup returned: &{Uid:1000 Gid:1000 Username:node Name: HomeDir:/home/node} 
INFO[0363] performing slow lookup of group ids for node 
INFO[0363] Running: [/bin/sh -c yarn install --network-timeout 100000] 
yarn install v1.22.5
info No lockfile found.
[1/4] Resolving packages...
INFO[0368] Pushed image to 1 destinations               
... A bunch of yarn logs ...
[4/4] Building fresh packages...
success Saved lockfile.
$ lerna bootstrap
lerna notice cli v3.22.1
lerna info bootstrap root only
yarn install v1.22.5
[1/4] Resolving packages...
success Already up-to-date.
$ lerna bootstrap
lerna notice cli v3.22.1
lerna WARN bootstrap Skipping recursive execution
Done in 20.00s.
Done in 616.92s.
INFO[0982] Taking snapshot of full filesystem...

Interesting to note that RUN yarn install --network-timeout 100000 is not the last step in the dockerfile.

neither --snapshotMode=redo nor --use-new-run solved the problem

sneerin commented 3 years ago

same issue , nothing changed only version of kaniko

Bobgy commented 3 years ago

I'm hitting the same problem, tried --snapshotMode=redo, but it does not always help. What will help us resolve the issue here? Does reproducible dockerfile + number of files help with debugging? I'm trying --use-new-run now.

Bobgy commented 3 years ago

Adding a data point, I was initially observing the build process freezing problem, when I do not add any memory/cpu request/limits. Then I added memory/cpu request & limits, the process starts to OOM. I increased memory limit to 6GB, but it still reaches OOM killed. When looking at the memory usage, it skyrockets at the end -- when log reaches taking snapshot of file system. EDIT: I tried building the same image in local docker, and maximum memory usage is less than 1GB.

logs

+ dockerfile=v2/container/driver/Dockerfile
+ context_uri=
+ context_artifact_path=/tmp/inputs/context_artifact/data
+ context_sub_path=
+ destination=gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver
+ digest_output_path=/tmp/outputs/digest/data
+ cache=true
+ cache_ttl=24h
+ context=
+ '[['  '!='  ]]
+ context=dir:///tmp/inputs/context_artifact/data
+ dirname /tmp/outputs/digest/data
+ mkdir -p /tmp/outputs/digest
+ /kaniko/executor --dockerfile v2/container/driver/Dockerfile --context dir:///tmp/inputs/context_artifact/data --destination gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver --snapshotMode redo --image-name-with-digest-file /tmp/outputs/digest/data '--cache=true' '--cache-ttl=24h'
E0730 12:20:40.314406      21 aws_credentials.go:77] while getting AWS credentials NoCredentialProviders: no valid providers in chain. Deprecated.
    For verbose messaging see aws.Config.CredentialsChainVerboseErrors
[36mINFO[0m[0000] Resolved base name golang:1.15-alpine to builder 
[36mINFO[0m[0000] Using dockerignore file: /tmp/inputs/context_artifact/data/.dockerignore 
[36mINFO[0m[0000] Retrieving image manifest golang:1.15-alpine 
[36mINFO[0m[0000] Retrieving image golang:1.15-alpine from registry index.docker.io 
E0730 12:20:40.518068      21 metadata.go:166] while reading 'google-dockercfg-url' metadata: http status code: 404 while fetching url 
http://metadata.google.internal./computeMetadata/v1/instance/attributes/google-dockercfg-url
[36mINFO[0m[0001] Retrieving image manifest golang:1.15-alpine 
[36mINFO[0m[0001] Returning cached image manifest              
[36mINFO[0m[0001] No base image, nothing to extract            
[36mINFO[0m[0001] Built cross stage deps: map[0:[/build/v2/build/driver]] 
[36mINFO[0m[0001] Retrieving image manifest golang:1.15-alpine 
[36mINFO[0m[0001] Returning cached image manifest              
[36mINFO[0m[0001] Retrieving image manifest golang:1.15-alpine 
[36mINFO[0m[0001] Returning cached image manifest              
[36mINFO[0m[0001] Executing 0 build triggers                   
[36mINFO[0m[0001] Checking for cached layer gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver/cache:9164be18ba887abd9388518d533d79a6e2fda9f81f33e57e0c71319d7a6da78e... 
[36mINFO[0m[0001] No cached layer found for cmd RUN apk add --no-cache make bash 
[36mINFO[0m[0001] Unpacking rootfs as cmd RUN apk add --no-cache make bash requires it. 
[36mINFO[0m[0009] RUN apk add --no-cache make bash             
[36mINFO[0m[0009] Taking snapshot of full filesystem...        
[36mINFO[0m[0016] cmd: /bin/sh                                 
[36mINFO[0m[0016] args: [-c apk add --no-cache make bash]      
[36mINFO[0m[0016] Running: [/bin/sh -c apk add --no-cache make bash] 
fetch 
https://dl-cdn.alpinelinux.org/alpine/v3.14/main/x86_64/APKINDEX.tar.gz
fetch 
https://dl-cdn.alpinelinux.org/alpine/v3.14/community/x86_64/APKINDEX.tar.gz
(1/5) Installing ncurses-terminfo-base (6.2_p20210612-r0)
(2/5) Installing ncurses-libs (6.2_p20210612-r0)
(3/5) Installing readline (8.1.0-r0)
(4/5) Installing bash (5.1.4-r0)
Executing bash-5.1.4-r0.post-install
(5/5) Installing make (4.3-r0)
Executing busybox-1.33.1-r2.trigger
OK: 9 MiB in 20 packages
[36mINFO[0m[0016] Taking snapshot of full filesystem...        
[36mINFO[0m[0017] Pushing layer gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver/cache:9164be18ba887abd9388518d533d79a6e2fda9f81f33e57e0c71319d7a6da78e to cache now 
[36mINFO[0m[0017] WORKDIR /build                               
[36mINFO[0m[0017] cmd: workdir                                 
[36mINFO[0m[0017] Changed working directory to /build          
[36mINFO[0m[0017] Creating directory /build                    
[36mINFO[0m[0017] Taking snapshot of files...                  
[36mINFO[0m[0017] Pushing image to gcr.io/kfp-ci/4674c4982ab8fcf476e610f372fc0e4a38686805/v2-sample-test/test/kfp-driver/cache:9164be18ba887abd9388518d533d79a6e2fda9f81f33e57e0c71319d7a6da78e 
[36mINFO[0m[0017] COPY api/go.mod api/go.sum api/              
[36mINFO[0m[0017] Taking snapshot of files...                  
[36mINFO[0m[0017] COPY v2/go.mod v2/go.sum v2/                 
[36mINFO[0m[0017] Taking snapshot of files...                  
[36mINFO[0m[0017] RUN cd v2 && go mod download                 
[36mINFO[0m[0017] cmd: /bin/sh                                 
[36mINFO[0m[0017] args: [-c cd v2 && go mod download]          
[36mINFO[0m[0017] Running: [/bin/sh -c cd v2 && go mod download] 
[36mINFO[0m[0018] Pushed image to 1 destinations               
[36mINFO[0m[0140] Taking snapshot of full filesystem...        
Killed

version: gcr.io/kaniko-project/executor:v1.6.0-debug args: I added snapshotMode redo, cache=true env: GKE 1.19, use kubeflow pipelies to run kaniko containers

Bobgy commented 3 years ago

I guess the root cause is actually insufficient memory, but when we do not allocate enough memory it will freeze on taking snapshot of full filesystem... as a symptom.

Bobgy commented 3 years ago

Edit: my guess is wrong, I reverted to kaniko:1.3.0-debug and added enough memory requests & limit, but I'm still observing the image build freezing problem from time to time.

oussemos commented 3 years ago

Hi @abhi1git, did you find a solution for your issue ? I am facing the same.

sph3rex commented 3 years ago

The issue is still actual for me too. Any updates?

pY4x3g commented 2 years ago

Same issue here, the system has enough memory (not hitting any memory limits), --snapshotMode=redo and --use-new-run are not changing the behavior at all, I do not see any problems when using trace verbosity. I am currently using 1.17.0-debug

oussemos commented 2 years ago

Hi @abhi1git, did you find a solution for your issue ? I am facing the same.

For us, after investigations, we found that the WAF in front of our Gitlab was blocking the requests. After whitelisting it, all is working fine.

jsravn commented 2 years ago

Still an issue, can you reopen @tejal29? Building an image like this shouldn't be OOMKilling/using GBs of RAM - seems like a clear cut bug to me.

gtskaushik commented 2 years ago

Hi @abhi1git, did you find a solution for your issue ? I am facing the same.

For us, after investigations, we found that the WAF in front of our Gitlab was blocking the requests. After whitelisting it, all is working fine.

What kind of whitelisting was required for this? Can you help me to clarify how to set it up?

oussemos commented 2 years ago

Hi @abhi1git, did you find a solution for your issue ? I am facing the same.

For us, after investigations, we found that the WAF in front of our Gitlab was blocking the requests. After whitelisting it, all is working fine.

What kind of whitelisting was required for this? Can you help me to clarify how to set it up?

If you have a WAF in front of Gitlab, It would be good to check your logs and confirm what kind of requests is blocking first.

irizzant commented 2 years ago

Anyone tried with version 1.7.0?

imjasonh commented 2 years ago

Anyone tried with version 1.7.0?

v1.7.0 is about 4 months old, and had some showstopper auth issues, and :latest currently points to :v1.6.0, so I would guess that not many folks are using :v1.7.0

Instead, while we wait for v1.8.0 (#1871) you can try a commit-tagged image, the latest of which is currently :09e70e44d9e9a3fecfcf70cb809a654445837631

irizzant commented 2 years ago

Thanks @imjasonh I'm going to try gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631-debug

irizzant commented 2 years ago

I've tried gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631-debug with --snapshotMode=redo --use-new-run, my pipeline is still stuck in

INFO[0009] Taking snapshot of full filesystem...

Guess the only solution is waiting for another commit-tagged image or 1.8.0 to be released

imjasonh commented 2 years ago

Guess the only solution is waiting for another commit-tagged image or 1.8.0 to be released

It sounds like whatever bug is causing that is still present, so it won't be fixed by releasing the latest image as v1.8.0. We just need someone to figure out why it gets stuck and fix it.

Unfortunately Kaniko is not really actively staffed at the moment, so it's probably going to fall to you or me or some other kind soul reading this to investigate and get us back on the track to solving this. Any takers?

irizzant commented 2 years ago

It sounds like whatever bug is causing that is still present, so it won't be fixed by releasing the latest image as v1.8.0. We just need someone to figure out why it gets stuck and fix it.

Hold on a second, maybe I spoke early!

My pipeline currently builds multiple images in parallel. I didn't realize before that one of them that before was stuck in taking snapshot now goes on smoothly with --snapshotMode=redo --use-new-run and gcr.io/kaniko-project/executor:09e70e44d9e9a3fecfcf70cb809a654445837631-debug.

The images actually stuck are basically the same Postgres image built with different build-arg values, so this ends up by running in parallel (and caching in parallel) the same layers.

I consequently tried to remove this parallelism and tried to build these Postgres images in sequence. I ended up with Postgres images stuck in taking snapshot in parallel with a totally different NodeJs image, also stuck in taking snapshots.

So from my tests it looks like when building images happens in parallel against the same registry mirror used as cache, if one image is taking snapshots in parallel with another it gets stuck.

It may be a coincidence, maybe not. I repeat: this is from my tests, it could be totally unrelated to the problem

chenlein commented 2 years ago

same issue:

  containers:
  - args:
    - --dockerfile=/workspace/Dockerfile
    - --context=dir:///workspace/
    - --destination=xxxx/xxx/xxx:1.0.0
    - --skip-tls-verify
    - --verbosity=debug
    - --build-arg="http_proxy='http://xxxx'"
    - --build-arg="https_proxy='http://xxxx'"
    - --build-arg="HTTP_PROXY='http://xxxx'"
    - --build-arg="HTTPS_PROXY='http://xxxx'"
    image: gcr.io/kaniko-project/executor:v1.7.0
    imagePullPolicy: IfNotPresent
    name: kaniko
    volumeMounts:
    - mountPath: /kaniko/.docker
      name: secret
    - mountPath: /workspace
      name: code

here are some logs, maybe useful

......
DEBU[0021] Whiting out /usr/share/doc/linux-libc-dev/.wh..wh..opq 
DEBU[0021] not including whiteout files                 
DEBU[0021] Whiting out /usr/share/doc/make/.wh..wh..opq 
DEBU[0021] not including whiteout files                 
DEBU[0021] Whiting out /usr/share/doc/pkg-config/.wh..wh..opq 
DEBU[0021] not including whiteout files                 
DEBU[0021] Whiting out /usr/share/gdb/auto-load/lib/.wh..wh..opq 
DEBU[0021] not including whiteout files                 
DEBU[0021] Whiting out /usr/share/glib-2.0/.wh..wh..opq 
DEBU[0021] not including whiteout files                 
DEBU[0021] Whiting out /usr/share/perl5/Dpkg/.wh..wh..opq 
DEBU[0021] not including whiteout files                 
DEBU[0021] Whiting out /usr/share/pkgconfig/.wh..wh..opq 
DEBU[0021] not including whiteout files                 
DEBU[0021] Whiting out /usr/local/go/.wh..wh..opq       
DEBU[0021] not including whiteout files                 
DEBU[0030] Whiting out /go/.wh..wh..opq                 
DEBU[0030] not including whiteout files                 
INFO[0030] ENV GOPRIVATE "gitee.com/dmcca/*"            
DEBU[0030] build: skipping snapshot for [ENV GOPRIVATE "gitee.com/dmcca/*"] 
INFO[0030] ENV GOPROXY "https://goproxy.cn,direct"      
DEBU[0030] build: skipping snapshot for [ENV GOPROXY "https://goproxy.cn,direct"] 
DEBU[0030] Resolved ./.netrc to .netrc                  
DEBU[0030] Resolved /root/.netrc to /root/.netrc        
DEBU[0030] Getting files and contents at root /workspace/ for /workspace/.netrc 
DEBU[0030] Using files from context: [/workspace/.netrc] 
INFO[0030] COPY ./.netrc /root/.netrc                   
DEBU[0030] Resolved ./.netrc to .netrc                  
DEBU[0030] Resolved /root/.netrc to /root/.netrc        
DEBU[0030] Getting files and contents at root /workspace/ for /workspace/.netrc 
DEBU[0030] Copying file /workspace/.netrc to /root/.netrc 
INFO[0030] Taking snapshot of files...                  
DEBU[0030] Taking snapshot of files [/root/.netrc / /root] 
INFO[0030] RUN chmod 600 /root/.netrc                   
INFO[0030] Taking snapshot of full filesystem...

max-au commented 2 years ago

Same issue here:

INFO[0163] Taking snapshot of full filesystem...        
fatal error: runtime: out of memory
runtime stack:
runtime.throw({0x12f3614, 0x16})
    /usr/local/go/src/runtime/panic.go:1198 +0x54
runtime.sysMap(0x4041c00000, 0x20000000, 0x220fdd0)
    /usr/local/go/src/runtime/mem_linux.go:169 +0xbc

<...>
github.com/google/go-containerregistry/pkg/v1/tarball.WithCompressedCaching.func1()
    /src/vendor/github.com/google/go-containerregistry/pkg/v1/tarball/layer.go:119 +0x6c fp=0x40005d3b10 sp=0x40005d3a80 pc=0xa6134c
github.com/google/go-containerregistry/pkg/v1/tarball.computeDigest(0x40008a5d70)
    /src/vendor/github.com/google/go-containerregistry/pkg/v1/tarball/layer.go:278 +0x44 fp=0x40005d3b80 sp=0x40005d3b10 pc=0xa624e4
github.com/google/go-containerregistry/pkg/v1/tarball.LayerFromOpener(0x400000d2c0, {0x40005d3cf8, 0x1, 0x1})
    /src/vendor/github.com/google/go-containerregistry/pkg/v1/tarball/layer.go:247 +0x3f4 fp=0x40005d3c20 sp=0x40005d3b80 pc=0xa62174
github.com/google/go-containerregistry/pkg/v1/tarball.LayerFromFile({0x4000a22018, 0x12}, {0x40005d3cf8, 0x1, 0x1})
    /src/vendor/github.com/google/go-containerregistry/pkg/v1/tarball/layer.go:188 +0x8c fp=0x40005d3c70 sp=0x40005d3c20 pc=0xa61cbc
github.com/GoogleContainerTools/kaniko/pkg/executor.pushLayerToCache(0x21d93a0, {0x40008b75c0, 0x40}, {0x4000a22018, 0x12}, {0x400016d940, 0x3a})
    /src/pkg/executor/push.go:295 +0x68 fp=0x40005d3ee0 sp=0x40005d3c70 pc=0xf1d4a8
github.com/GoogleContainerTools/kaniko/pkg/executor.(*stageBuilder).build.func3()
    /src/pkg/executor/build.go:425 +0xa4 fp=0x40005d3f60 sp=0x40005d3ee0 pc=0xf16474
<...>
compress/gzip.(*Writer).Write(0x40006780b0, {0x40014f6000, 0x8000, 0x8000})
    /usr/local/go/src/compress/gzip/gzip.go:196 +0x388
io.copyBuffer({0x1678960, 0x40006780b0}, {0x167dfe0, 0x40006163e8}, {0x0, 0x0, 0x0})
    /usr/local/go/src/io/io.go:425 +0x224
io.Copy(...)
    /usr/local/go/src/io/io.go:382
github.com/google/go-containerregistry/internal/gzip.ReadCloserLevel.func1(0x400064be80, 0x1, 0x40006163f8, {0x16902e0, 0x40006163e8})
    /src/vendor/github.com/google/go-containerregistry/internal/gzip/zip.go:60 +0xb4
created by github.com/google/go-containerregistry/internal/gzip.ReadCloserLevel
    /src/vendor/github.com/google/go-containerregistry/internal/gzip/zip.go:52 +0x230

Docker works fine (yet requires privileged mode).

irizzant commented 2 years ago

Version 1.8 has been released can you try with that?

max-au commented 2 years ago

The stack traces I pasted are from 1.8.0.

irizzant commented 2 years ago

@max-au yours looks like a different problem though

INFO[0163] Taking snapshot of full filesystem... fatal error: runtime: out of memory

This is an out of memory error while the problem reported here is that the build just freezes and doesn't show any error or progress

xpacm commented 2 years ago

Maybe setting this to false will help:

https://github.com/GoogleContainerTools/kaniko#--compressed-caching

baslr commented 2 years ago

we could fix the gitlab cicd pipeline error

Taking snapshot of full filesystem....
Killed

with --compressed-caching=false and v1.8.0-debug. The image is around 2 GB. Alpine reported around 4 GB in around 100 packages.

gerrnot commented 2 years ago

Had the same issue when running on a small demo environment.

Kubectl top pods showed 6633Mi memory consumption.

Issue went away by running the build on a "real" cluster, I did not fiddle with compression params, but I use caching.

Just curious why it failed with exit code 1 and it does not show the usual OOMKilled. This makes it really hard to find out the root cause.

b0nete commented 2 years ago

we could fix the gitlab cicd pipeline error
Taking snapshot of full filesystem....
Killed
with --compressed-caching=false and v1.8.0-debug. The image is around 2 GB. Alpine reported around 4 GB in around 100 packages.

Thanks @baslr, this worked for me.

stranljip commented 2 years ago

We started to have this problem in the last few days within out GitLab CI. The workarounds did not work for us. After discovering the version tag syntax in the GitLab documentation (https://docs.gitlab.com/ee/ci/docker/using_kaniko.html) we switched to gcr.io/kaniko-project/executor:v1.8.0-debug and the problem effectively disappeared.

stranljip commented 2 years ago

We started to have this problem in the last few days within out GitLab CI. The workarounds did not work for us. After discovering the version tag syntax in the GitLab documentation (https://docs.gitlab.com/ee/ci/docker/using_kaniko.html) we switched to gcr.io/kaniko-project/executor:v1.8.0-debug and the problem effectively disappeared.

Seems I was too fast. The problem persists (at least in some jobs)

kremenevskiy commented 2 years ago

We started to have this problem in the last few days within out GitLab CI. The workarounds did not work for us. After discovering the version tag syntax in the GitLab documentation (https://docs.gitlab.com/ee/ci/docker/using_kaniko.html) we switched to gcr.io/kaniko-project/executor:v1.8.0-debug and the problem effectively disappeared.

Seems I was too fast. The problem persists (at least in some jobs)

same problem: INFO[0170] Taking snapshot of full filesystem... ERROR: Job failed: pod "runner-xxxxxxx" status is "Failed"

But the problem disappears when i dropped some packages from poetry that i need, when add again, this problem come back. Tried delete different poetry packages and there is no dependency when it will crash. If you manage to solve it let us know please, ty

kremenevskiy commented 2 years ago

We started to have this problem in the last few days within out GitLab CI. The workarounds did not work for us. After discovering the version tag syntax in the GitLab documentation (https://docs.gitlab.com/ee/ci/docker/using_kaniko.html) we switched to gcr.io/kaniko-project/executor:v1.8.0-debug and the problem effectively disappeared.

Seems I was too fast. The problem persists (at least in some jobs) @stranljip @max-au @irizzant @chenlein @pY4x3g

SOLVED PROBLEM!!!! This error caused by ci/ci runner, cause it haven't sufficient space (memory on runner pod) to save its cashes and other staff while building image. Space can be adjusted with parameter: ephemeral storage You can read more about it here: https://docs.openshift.com/container-platform/4.7/storage/understanding-ephemeral-storage.html

Just increased from 4GB to 6GB and all issues gone. All pipelines succeded!

ghost commented 1 year ago

I am having the same problem when using Kaniko in GitLab CI. The workarounds didn't work, it times out (stuck in INFO[0744] Taking snapshot of full filesystem...)

This is the stage/job I'm using. As you can see, it uses version 1.8.0-debug and the --compressed-caching=false

The .gitlab-ci.yml

stages:
  - delivery

container_registry:
  stage: delivery
  image:
    name: gcr.io/kaniko-project/executor:v1.8.0-debug
    entrypoint: [""]
  before_script:
    - IMAGE_TAG=$CI_COMMIT_SHORT_SHA
    - |-
      cat << EOF > $CI_PROJECT_DIR/.dockerignore
      bin/
      obj/
      EOF
    - cat $CI_PROJECT_DIR/.dockerignore
    - |-
      cat << EOF > /kaniko/.docker/config.json
      {
        "auths": {
          "$CI_REGISTRY": {
            "username": "$CI_REGISTRY_USER",
            "password": "$CI_REGISTRY_PASSWORD"
          }
        }
      }
      EOF
    - cat /kaniko/.docker/config.json
  script:
    - /kaniko/executor 
      --context $CI_PROJECT_DIR 
      --dockerfile $CI_PROJECT_DIR/Dockerfile 
      --destination $CI_REGISTRY_IMAGE:latest 
      --destination $CI_REGISTRY_IMAGE:$IMAGE_TAG
      --compressed-caching=false
      --verbosity=debug
  rules:
    - if: $CI_COMMIT_TAG
      when: never
    - when: on_success

The specific Dockerfile I'm testing it with is a large one (don't know how to make it smaller!) and it has an ubuntu base OS with .NET SDK, Android SDK, JDK plus some tools (to help me build .NET Maui apps targeting Android).

ARG REPO=mcr.microsoft.com/dotnet/aspnet
FROM $REPO:7.0.1-jammy-amd64 AS platform

ENV \
    # Unset ASPNETCORE_URLS from aspnet base image
    ASPNETCORE_URLS= \
    # Do not generate certificate
    DOTNET_GENERATE_ASPNET_CERTIFICATE=false \
    # Do not show first run text
    DOTNET_NOLOGO=true \
    # SDK version
    DOTNET_SDK_VERSION=7.0.101 \
    # Enable correct mode for dotnet watch (only mode supported in a container)
    DOTNET_USE_POLLING_FILE_WATCHER=true \
    # Skip extraction of XML docs - generally not useful within an image/container - helps performance
    NUGET_XMLDOC_MODE=skip \
    # PowerShell telemetry for docker image usage
    POWERSHELL_DISTRIBUTION_CHANNEL=PSDocker-DotnetSDK-Ubuntu-22.04

RUN apt-get update \
    && apt-get install -y --no-install-recommends \
        curl \
        git \
        wget \
    && rm -rf /var/lib/apt/lists/*

# Install .NET SDK
RUN curl -fSL --output dotnet.tar.gz https://dotnetcli.azureedge.net/dotnet/Sdk/$DOTNET_SDK_VERSION/dotnet-sdk-$DOTNET_SDK_VERSION-linux-x64.tar.gz \
    && dotnet_sha512='cf289ad0e661c38dcda7f415b3078a224e8347528448429d62c0f354ee951f4e7bef9cceaf3db02fb52b5dd7be987b7a4327ca33fb9239b667dc1c41c678095c' \
    && echo "$dotnet_sha512  dotnet.tar.gz" | sha512sum -c - \
    && mkdir -p /usr/share/dotnet \
    && tar -oxzf dotnet.tar.gz -C /usr/share/dotnet ./packs ./sdk ./sdk-manifests ./templates ./LICENSE.txt ./ThirdPartyNotices.txt \
    && rm dotnet.tar.gz \
    # Trigger first run experience by running arbitrary cmd
    && dotnet help

# Install PowerShell global tool
RUN powershell_version=7.3.0 \
    && curl -fSL --output PowerShell.Linux.x64.$powershell_version.nupkg https://pwshtool.blob.core.windows.net/tool/$powershell_version/PowerShell.Linux.x64.$powershell_version.nupkg \
    && powershell_sha512='c4a72142e2bfae0c2a64a662f1baa27940f1db8a09384c90843163e339581d8d41824145fb9f79c680f9b7906043365e870d48d751ab8809c15a590f47562ae6' \
    && echo "$powershell_sha512  PowerShell.Linux.x64.$powershell_version.nupkg" | sha512sum -c - \
    && mkdir -p /usr/share/powershell \
    && dotnet tool install --add-source / --tool-path /usr/share/powershell --version $powershell_version PowerShell.Linux.x64 \
    && dotnet nuget locals all --clear \
    && rm PowerShell.Linux.x64.$powershell_version.nupkg \
    && ln -s /usr/share/powershell/pwsh /usr/bin/pwsh \
    && chmod 755 /usr/share/powershell/pwsh \
    # To reduce image size, remove the copy nupkg that nuget keeps.
    && find /usr/share/powershell -print | grep -i '.*[.]nupkg$' | xargs rm

# JAVA
RUN apt-get update && \
    apt-get install -y openjdk-11-jdk && \
    rm -rf /var/lib/apt/lists/*

ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64/

# Install workload maui
RUN dotnet workload install maui-android --ignore-failed-sources

# Utils
RUN apt-get update && apt-get install -y \
    unzip \
    jq \
    bzip2 \
    libzip4 \
    libzip-dev && \
    rm -rf /var/lib/apt/lists/*

# Install Android SDK
RUN mkdir -p /usr/lib/android-sdk/cmdline-tools/latest && \
    curl -k "https://dl.google.com/android/repository/commandlinetools-linux-9123335_latest.zip" -o commandlinetools-linux.zip && \
    unzip -q commandlinetools-linux.zip -d /usr/lib/android-sdk/tmp && \
    mv  /usr/lib/android-sdk/tmp/cmdline-tools/* /usr/lib/android-sdk/cmdline-tools/latest && \
    rm -rf /usr/lib/android-sdk/tmp/ && \
    rm commandlinetools-linux.zip 

ENV ANDROID_SDK_ROOT=/usr/lib/android-sdk
ENV PATH=$ANDROID_SDK_ROOT/cmdline-tools/latest/bin:$PATH

RUN yes | sdkmanager --licenses && \
    sdkmanager "platform-tools" && \
    sdkmanager "ndk-bundle" && \
    sdkmanager "build-tools;33.0.0" "platforms;android-33"

Is there anything else I could do?

It eventually fails (after 33 minutes or so), with a space related error

INFO[0744] Taking snapshot of full filesystem...        
error building image: error building stage: failed to take snapshot: write /kaniko/323538799: no space left on device
Cleaning up project directory and file based variables 00:01
ERROR: Job failed: exit code 1

braykov commented 1 year ago

@diego-roundev, the 60 min. timeout is the default value for GitLab to wait for a job execution. You don't need to play with that. If you have access to the gitlab runner settings, go and increase the available memory. To find out by how much, build your image locally with docker and check the image size. Make sure the gitlab runner can use more Mb/Gb than you image size.

chrisbakr commented 1 year ago

I have the same issue on Gitlab CI/CD but only when cache is set to true

RuslanAbdullaev commented 1 year ago

I have this problem too in Gitlab CI/CD

aleksey-masl commented 1 year ago

Hello everyone! I found solution here https://stackoverflow.com/questions/67748472/can-kaniko-take-snapshots-by-each-stage-not-each-run-or-copy-operation adding option to kaniko --single-snapshot

/kaniko/executor --context "${CI_PROJECT_DIR}" --dockerfile "${CI_PROJECT_DIR}/Dockerfile" --destination "${YC_CI_REGISTRY}/${YC_CI_REGISTRY_ID}/${CI_PROJECT_PATH}:${CI_COMMIT_SHA}" --single-snapshot

iamkhalidbashir commented 1 year ago

I have this problem too in Gitlab CI/CD

Same for me too

aleksey-masl commented 1 year ago

If it doesn't work, then may try adding --use-new-run and --snapshot-mode=redo All flags https://github.com/GoogleContainerTools/kaniko/blob/main/README.md# For mу it is working!

- mkdir -p /kaniko/.docker
- echo "{\"auths\":{\"${YC_CI_REGISTRY}\":{\"auth\":\"$(printf "%s:%s" "${YC_CI_REGISTRY_USER}" "${YC_CI_REGISTRY_PASSWORD}" | base64 | tr -d '\n')\"}}}" > /kaniko/.docker/config.json
- >-
  /kaniko/executor
  --context "${CI_PROJECT_DIR}"
  --use-new-run
  --snapshot-mode=redo
  --dockerfile "${CI_PROJECT_DIR}/Dockerfile"
  --destination "${YC_CI_REGISTRY}/${YC_CI_REGISTRY_ID}/${CI_PROJECT_PATH}:${CI_COMMIT_REF_SLUG}-${CI_COMMIT_SHA}"

bhack commented 1 year ago

I have the same issue. Is it a disk size issue?

aminya commented 1 year ago

we could fix the gitlab cicd pipeline error
Taking snapshot of full filesystem....
Killed
with --compressed-caching=false and v1.8.0-debug. The image is around 2 GB. Alpine reported around 4 GB in around 100 packages.

I see this answer being lost in this thread, but it fixed the issue for me. Just pass this flag to Kaniko

--compressed-caching=false

bhack commented 1 year ago

--compressed-caching=false

it Is not available in the Skaffold schema for Kaniko. So I am trying to understand the root cause of this issue

neighbour-oldhuang commented 3 months ago

俺也一样

kirin-13 commented 3 months ago

俺也一样

添加--single-snapshot参数试试。

hottehead commented 2 months ago

I had this problem when trying to install terraform in an alpine linux image with the recommendations from this page https://www.hashicorp.com/blog/installing-hashicorp-tools-in-alpine-linux-containers

However the apk del .deps command in the very last line triggered the issue. Presumably this changes a lot of files?

GoogleContainerTools / kaniko

Image build process Freezes on `Taking snapshot of full filesystem...` #1333