multistage build in same container fails because cross stage deps are not cleaned up ( symlinks ) #1406

Open RoSk0 opened 3 years ago

Actual behavior I want to set up image building for our project as part of CI pipeline using GitLab CI capabilities. Following https://docs.gitlab.com/13.2/ee/ci/docker/using_kaniko.html#building-a-docker-image-with-kaniko I done CI configuration and it works perfect if you build one image per job. It is not GitLab issue, just bear with me.

We have a multi stage Dockerfile to build our images. So if you try build multiple targets inside same ( and this is crucial ) container it will fail with:

error building image: could not save file: symlink ../chi-teck/drupal-code-generator/bin/dcg /kaniko/0/app/vendor/bin/dcg: file exists

Expected behavior

Two (in my case) images built.

To Reproduce Output of commands that successfully ran is omitted:

$ docker run --rm  --interactive  --tty  --volume $PWD:/app  --user $(id -u):$(id -g)  composer:1  create-project --ignore-platform-reqs drupal/recommended-project kaniko-test
$ cd kaniko-test
$ docker run --rm  --interactive  --tty  --volume $PWD:/app  --user $(id -u):$(id -g)  composer:1  require --ignore-platform-reqs drush/drush:^10
$ cat <<EOF >> Dockerfile
FROM composer:1 AS full-code-base
WORKDIR /app
COPY composer.json composer.lock /app/
RUN composer install --ignore-platform-reqs --no-dev --working-dir=/app
COPY web /app/web
RUN composer dump-autoload --optimize --working-dir=/app

FROM php:7.4-fpm-buster AS project-php
COPY --from=full-code-base /app /app

FROM nginx:1 AS project-nginx
COPY --from=full-code-base /app /app

EOF
$ docker run --network=host -v $(pwd):/workspace  --entrypoint '' --rm -it gcr.io/kaniko-project/executor:debug sh
inside container $ executor --target project-php --destination kanico-test-image:php-latest --no-push
inside container $ executor --target project-nginx --destination kanico-test-image:nginx-latest --no-push
INFO[0000] Resolved base name composer:1 to full-code-base
INFO[0000] Resolved base name php:7.4-fpm-buster to project-php
INFO[0000] Resolved base name nginx:1 to project-nginx
INFO[0000] Retrieving image manifest composer:1
INFO[0000] Retrieving image composer:1
INFO[0003] Retrieving image manifest composer:1
INFO[0003] Retrieving image composer:1
INFO[0006] Retrieving image manifest php:7.4-fpm-buster
INFO[0006] Retrieving image php:7.4-fpm-buster
INFO[0009] Retrieving image manifest php:7.4-fpm-buster
INFO[0009] Retrieving image php:7.4-fpm-buster
INFO[0012] Retrieving image manifest nginx:1
INFO[0012] Retrieving image nginx:1
INFO[0014] Retrieving image manifest nginx:1
INFO[0014] Retrieving image nginx:1
INFO[0017] Built cross stage deps: map[0:[/app /app]]
INFO[0017] Retrieving image manifest composer:1
INFO[0017] Retrieving image composer:1
INFO[0019] Retrieving image manifest composer:1
INFO[0019] Retrieving image composer:1
INFO[0022] Executing 0 build triggers
INFO[0022] Unpacking rootfs as cmd COPY composer.json composer.lock /app/ requires it.
INFO[0033] WORKDIR /app
INFO[0033] cmd: workdir
INFO[0033] Changed working directory to /app
INFO[0033] No files changed in this command, skipping snapshotting.
INFO[0033] COPY composer.json composer.lock /app/
INFO[0033] Taking snapshot of files...
INFO[0033] RUN composer install --ignore-platform-reqs --no-dev --working-dir=/app
INFO[0033] Taking snapshot of full filesystem...
INFO[0036] cmd: /bin/sh
INFO[0036] args: [-c composer install --ignore-platform-reqs --no-dev --working-dir=/app]
INFO[0036] Running: [/bin/sh -c composer install --ignore-platform-reqs --no-dev --working-dir=/app]
Loading composer repositories with package information
Installing dependencies from lock file
Nothing to install or update
Generating autoload files
INFO[0036] Taking snapshot of full filesystem...
INFO[0037] Taking snapshot of files...
INFO[0037] COPY web /app/web
INFO[0039] Taking snapshot of files...
INFO[0041] RUN composer dump-autoload --optimize --working-dir=/app
INFO[0041] cmd: /bin/sh
INFO[0041] args: [-c composer dump-autoload --optimize --working-dir=/app]
INFO[0041] Running: [/bin/sh -c composer dump-autoload --optimize --working-dir=/app]
Generating optimized autoload files
Generated optimized autoload files containing 4906 classes
INFO[0042] Taking snapshot of full filesystem...
INFO[0046] Saving file app for later use
error building image: could not save file: symlink ../chi-teck/drupal-code-generator/bin/dcg /kaniko/0/app/vendor/bin/dcg: file exists

I've tried to raise verbosity level to debug - nothing useful. With trace it shows way too much to digest.

Directory content of vendor/bin is:

$ ll vendor/bin/
total 8
drwxr-xr-x  2 kirill kirill 4096 Aug 31 16:45 ./
drwxr-xr-x 31 kirill kirill 4096 Aug 31 16:45 ../
lrwxrwxrwx  1 kirill kirill   41 Aug 31 16:45 dcg -> ../chi-teck/drupal-code-generator/bin/dcg*
lrwxrwxrwx  1 kirill kirill   20 Aug 31 16:45 drush -> ../drush/drush/drush*
lrwxrwxrwx  1 kirill kirill   33 Aug 31 16:45 php-parse -> ../nikic/php-parser/bin/php-parse*
lrwxrwxrwx  1 kirill kirill   22 Aug 31 16:45 psysh -> ../psy/psysh/bin/psysh*
lrwxrwxrwx  1 kirill kirill   44 Aug 31 16:45 release -> ../consolidation/self-update/scripts/release*
lrwxrwxrwx  1 kirill kirill   26 Aug 31 16:45 robo -> ../consolidation/robo/robo*
lrwxrwxrwx  1 kirill kirill   51 Aug 31 16:41 var-dump-server -> ../symfony/var-dumper/Resources/bin/var-dump-server*

Additional Information

Dockerfile Included in the steps to reproduce
Build Context Included in the steps to reproduce

Kaniko Image (fully qualified with digest)

$ docker inspect gcr.io/kaniko-project/executor:debug
[
{
    "Id": "sha256:b0070f18add278df20229ce34172fc16a4c76392fc28d33df7837396a2b882c0",
    "RepoTags": [
        "gcr.io/kaniko-project/executor:debug"
    ],
    "RepoDigests": [
        "gcr.io/kaniko-project/executor@sha256:0f27b0674797b56db08010dff799c8926c4e9816454ca56cc7844df228c53485"
    ],
    "Created": "2020-08-18T02:40:08.570969026Z",
    "DockerVersion": "19.03.8",
    "Architecture": "amd64",
    "Os": "linux",
}

Triage Notes for the Maintainers

Description	Yes/No
Please check if this a new feature you are proposing	- [x]
Please check if the build works in docker but not in kaniko	- [x]
Please check if this error is seen when you use `--cache` flag	- [x]
Please check if your dockerfile is a multistage dockerfile	- [x]

I had the same problem, you need to run --cleanup if you wish to reuse the same kaniko container

https://github.com/GoogleContainerTools/kaniko#--cleanup

Thanks for the suggestion @alanhughes . I've tested with the image from original report (repo digest gcr.io/kaniko-project/executor@sha256:0f27b0674797b56db08010dff799c8926c4e9816454ca56cc7844df228c53485) by adding adding --cleanup to call, like :

executor --cleanup --target project-php --destination kanico-test-image:php-latest --no-push
executor --cleanup --target project-nginx --destination kanico-test-image:php-latest --no-push

but result is the same - error building image: could not save file: symlink ../chi-teck/drupal-code-generator/bin/dcg /kaniko/0/app/vendor/bin/dcg: file exists.

Then I updated Kaniko image:

$ docker inspect gcr.io/kaniko-project/executor:debug
[
    {
        "Id": "sha256:ffca8c9f01a23d0886106b46f9bdd68dc5ca29d3377434bb69020df0cb2982a8",
        "RepoTags": [
            "gcr.io/kaniko-project/executor:debug"
        ],
        "RepoDigests": [
            "gcr.io/kaniko-project/executor@sha256:473d6dfb011c69f32192e668d86a47c0235791e7e857c870ad70c5e86ec07e8c"
        ],
        "Parent": "",
        "Comment": "",
        "Created": "2020-10-29T17:27:40.548045213Z",
        "DockerVersion": "19.03.8",
    }
]

which is Kaniko version : v1.3.0 and run steps to re-produce again - same result, despite the fact that there is additional INFO[0153] Deleting filesystem... output entry when building first container.

@RoSk0 Have you found a solution. - I'm running out of ideas.

Unfortunately no. I've splited my build job into three :(

I just went into same problem ;(

I'm using npm for NodeJS package management and it creates node_modules/.bin directory which contains a lot of symlinks to different Node modules scripts and it fails on this ;(

INFO[0076] Saving file code/node_modules for later use  
error building image: could not save file: symlink ../google-p12-pem/build/src/bin/gp12-pem.js /kaniko/0/code/node_modules/.bin/gp12-pem: file exists

I also have encountered this issue and have split my job into three separate jobs. The --cleanup, --no-push combo did not resolve this for me.

Unfortunately it is still an issue with the latest release :(

executor version
Kaniko version :  v1.6.0

This is basically a duplicate of my (currently closed) issue #1217 and I provided a minimal reproduction there, which I just updated to v1.6.0:

Command line Dockerfile Full log

I am also seeing this issue with the latest release. Only occurs when building multiple images in the same container with the --cleanup flag. gcr.io/kaniko-project/executor:debug 7053f62a27a8

I can replicate this with npm and a two stage build, but it happens in the first stage:

INFO[0048] Saving file app for later use                
INFO[0050] Saving file app/dist for later use           
INFO[0050] Saving file app/node_modules for later use   
error building image: could not save file: symlink ../acorn/bin/acorn /kaniko/0/app/node_modules/.bin/acorn: file exists
Command exited with non-zero status 1

Running on AKS with Gitlab CI. Using gcr.io/kaniko-project/executor:v1.6.0-debug

time /kaniko/executor \
  --context "${CI_PROJECT_DIR}" \
  --dockerfile "${CI_PROJECT_DIR}/Dockerfile" \
  --cache=true \
  --destination "${IMAGE_TAG}" \
  --build-arg NODE_IMAGE="${NODE_IMAGE}" \
  --build-arg VERSION="${VERSION}" \
  --build-arg VERSION_SEMVER="${VERSION_SEMVER}"

ARG NODE_IMAGE
ARG VERSION="not_set"

FROM $NODE_IMAGE as build

ARG VERSION
ENV APP_VERSION=${VERSION}
ARG VERSION_SEMVER="not_set"

WORKDIR /app

COPY package.json package-lock.json .npmrc ./
RUN npm version "${VERSION_SEMVER}" \
    && npm ci

COPY . .
RUN npm run build:ci

# -----------------------------------------------------

FROM $NODE_IMAGE

# ...

The following workaroud works for me. After each execution I add:

rm -rf /kaniko/0

For example:

execute() {
  /kaniko/executor  --context . --build-arg=MYARG=1$ --cleanup --destination myregistry.com/repo:tag-$1
  rm -rf /kaniko/0
}

while read -r line; do
  execute $line
done < my_file

I had the same issue and on top of that, if you have more than two stages, Kaniko will also create /kaniko/1 and so on.

Having the same problem here, up to, and including 1.9.1.

Is there any repair plan

I can still reproduce it with version gcr.io/kaniko-project/executor:v1.16.0-debug

Using dockerfile as @AndreKR mentioned https://github.com/GoogleContainerTools/kaniko/issues/1406#issuecomment-886083696

Also idk whats going under the hood exactly in kaniko but is it desired that when building image we are working on / fs of kaniko container? With such dockerfile

FROM alpine:3.17.1
RUN rm -rf /kaniko

im able to remove kaniko executor on kaniko continer.

➜  ~  docker run -it --rm --name kaniko-test --entrypoint="/busybox/sh" gcr.io/kaniko-project/executor:v1.16.0-debug

/workspace # 
/workspace # vi Dockerfile
/workspace # 
/workspace # cat Dockerfile 
FROM alpine:3.17.1
RUN rm -rf /kaniko
/workspace # 
/workspace # /kaniko/executor \
>     --no-push \
>     --log-format=text \
>     --cleanup
time="2023-09-27T13:26:10Z" level=info msg="Retrieving image manifest alpine:3.17.1"
time="2023-09-27T13:26:10Z" level=info msg="Retrieving image alpine:3.17.1 from registry index.docker.io"
time="2023-09-27T13:26:12Z" level=info msg="Built cross stage deps: map[]"
time="2023-09-27T13:26:12Z" level=info msg="Retrieving image manifest alpine:3.17.1"
time="2023-09-27T13:26:12Z" level=info msg="Returning cached image manifest"
time="2023-09-27T13:26:12Z" level=info msg="Executing 0 build triggers"
time="2023-09-27T13:26:12Z" level=info msg="Building stage 'alpine:3.17.1' [idx: '0', base-idx: '-1']"
time="2023-09-27T13:26:12Z" level=info msg="Unpacking rootfs as cmd RUN rm -rf /kaniko requires it."
time="2023-09-27T13:26:17Z" level=info msg="RUN rm -rf /kaniko"
time="2023-09-27T13:26:17Z" level=info msg="Initializing snapshotter ..."
time="2023-09-27T13:26:17Z" level=info msg="Taking snapshot of full filesystem..."
time="2023-09-27T13:26:17Z" level=info msg="Cmd: /bin/sh"
time="2023-09-27T13:26:17Z" level=info msg="Args: [-c rm -rf /kaniko]"
time="2023-09-27T13:26:17Z" level=info msg="Running: [/bin/sh -c rm -rf /kaniko]"
error building image: error building stage: failed to take snapshot: open /kaniko/2366305773: no such file or directory
/workspace # ls -la /kaniko
ls: /kaniko: No such file or directory
/workspace #

So I dig for a while and found that /kaniko/0 won't be deleted as path /kaniko and all other children of this directory are on defaultIgnoreList https://github.com/GoogleContainerTools/kaniko/blob/main/pkg/util/fs_util.go#L63 - this list containt path to exclude from deleting in function DeleteFilesystem() https://github.com/GoogleContainerTools/kaniko/blob/main/pkg/util/fs_util.go#L245

Path like /workspace is cleaned up - so i suggest that kaniko should not store data in /kaniko/0 directory but instead in something like /layers/0 ( so it will be config.RootDir/layers/......). The same happens for snapshot files (files with names containing only numbers like /kaniko/0123123 ) - they are also created in /kaniko directory which means they never will be deleted. So snapshot files should be located in other location like config.RootDir/snaphosts/.... https://github.com/GoogleContainerTools/kaniko/blob/main/pkg/snapshot/snapshot.go#L66

After reconsidering the topic i think the best solution for this issue is to distinguish cleanup of the filesystem after the stage and clean filesystem with the flag --cleanup as currently same function is used for both cases. So there should be:

this function will be still used after stage https://github.com/GoogleContainerTools/kaniko/blame/main/pkg/util/fs_util.go#L226C16-L226C16
new function CleanupFilesystem() should do almost the same as the above function but additionally remove directory that store stage dependent files (those directory are created here https://github.com/GoogleContainerTools/kaniko/blob/main/pkg/executor/build.go#L811). In directory /kaniko there are also snapshots (which are layers) which are used only to push to cashe https://github.com/GoogleContainerTools/kaniko/blob/main/pkg/executor/build.go#L428 and to create final image in https://github.com/GoogleContainerTools/kaniko/blob/main/pkg/executor/build.go#L432. It means that those files also can be deleted in CleanupFilesystem() function. (so creating directories like /kaniko/layers and /kaniko/stages can be helpful as at the end those direcotries could be simply removed withour adding any extra complicated logic)

I created a simple PR with changes that remove all leftovers using regex. I'm open to discussion :)

I hope this gets merged. Tanks. However wanted to post this in case someone else finds this problem that I managed to solve:

Hello after finding this issue here and trying a lot of things to find a generic fix for my case, I found what I think solves most my error cases. In my case this solved all the failing builds with the different dockerfiles of around 15 projects (not all were failing but the ones with dockerfiles with more stages were more prone to fail).

My use case of kaniko is inside a jenkins pipeline that is using kubernetes plugin to run jobs inside kubernetes agent pods. Those agents have defined 1 single kaniko container and my need was to build the image twice with that single kaniko container, once as a tar to scan it with Trivy (a tool to scan containers) and after some quality checks are met use again the kaniko container to just build the image again and upload it to ECR.

My solution was adding this to my first call of building the image as a tar: && rm -rf /kaniko/*[0-9]* && rm -rf /kaniko/Dockerfile && mkdir -p /workspace

Call ending like this.

/kaniko/executor -f 'pwd'/docker/Dockerfile -c 'pwd' --tar-path='pwd'/image.tar --single-snapshot --no-push --destination=image --cleanup && rm -rf /kaniko/*[0-9]* && rm -rf /kaniko/Dockerfile && mkdir -p /workspace

Not a huge kaniko user myself but found this /kaniko directory was filled with some files after the 1st execution as some people in this thread mentioned. those files were messing the next execution. Those commands after the 1st build remove those problematic files and second execution works as a charm.

Hope this helps other people that find this issue. Thanks.

@ricardllop you can also use crane to upload a container tar. No need to rebuild the image. I mean i don’t know your usecase in detail, but it sounds like.

https://github.com/google/go-containerregistry/blob/main/cmd/crane/doc/crane.md

Is it possible for files in these locations to be overwritten rather than rewritten? Are all of these files used in the next stage of the image build, or is there a "list" that details the files stashed for that particular run that can be updated so only the relevant files need to be overwritten or pulled into the next stage?

@bdols @RoSk0

Unfortunately no. I've splited my build job into three :(

It's concerning that the issue persists despite its age. I faced a similar problem using GitLab CI to build images in a Kubernetes cluster.

To avoid splitting stages into different jobs, consider creating a cleanup script and aliasing it as follows:

before_script:
   - alias kaniko-cleanup='ls /kaniko | grep -v "docker-credential-acr-env\|docker-credential-gcr\|docker-credential-ecr-login\|executor\|warmer\|ssl" | xargs -I {} rm -rf /kaniko/{}'

Then, in your scripts:

scripts:
   - /kaniko/executor ...
   - kaniko-cleanup
   - /kaniko/executor ...
   - kaniko-cleanup
   - /kaniko/executor ...
   - kaniko-cleanup

This approach resolved the issue for me.

But ideally, kaniko should handle symlinks identically to Docker and do a proper cleanup following each job, especially considering its role in producing images from Docker files.

Hope this helps.

GoogleContainerTools / kaniko

multistage build in same container fails because cross stage deps are not cleaned up ( symlinks ) #1406