GoogleContainerTools / kaniko

Build Container Images In Kubernetes
Apache License 2.0
14.86k stars 1.44k forks source link

Cache not found from package installs in previous multistage stages #3246

Open joeauty opened 4 months ago

joeauty commented 4 months ago

Actual behavior Apt install commands are not being cached running Kaniko as a Kubernetes job, making image builds very slow

Expected behavior Cache layer hit/being found since, as far as I can tell, there should be no filesystem differences between builds

To Reproduce Steps to reproduce the behavior:

  1. Run the Kubernetes job:
    image: gcr.io/kaniko-project/executor:v1.23.1-debug
    command: ["/kaniko/executor"]
    args:
    - "--context=/workspace/build/buildkite/${BUILDKITE_ORGANIZATION_SLUG}/${BUILDKITE_PIPELINE_SLUG}"
    - "--destination=[repo URL]"
    - "--cache-repo=[cache repo URL]"
    - "--cache=true"
    - "--cleanup=true"
    - "--ignore-path=.git"

Additional Information

# ---- BASE IMAGE ----
FROM ruby:3.3.3-slim-bullseye as base-image

ENV INSTALL_PATH /data/go
ENV GETTEXT_LOCALES_PATH  $INSTALL_PATH/config/gettext_locales
ENV GETTEXT_CLIENT_LOCALES_PATH $INSTALL_PATH/client/locales
WORKDIR $INSTALL_PATH

RUN apt-get update && apt-get install -y libicu-dev libpq-dev python3-pip python-dev build-essential --no-install-recommends && apt-get clean \
  && pip install --upgrade setuptools pip \
  && pip install awscli \
  && gem update --system 3.5.13 \
  && gem install bundler:2.5.13

# ---- BUILD DEPENDENCIES ----
FROM base-image as build-dependencies

ENV INSTALL_PATH /data/go
ENV NODE_MAJOR 18
WORKDIR $INSTALL_PATH

SHELL ["/bin/bash", "-lc"]

RUN apt-get update && apt-get install -y curl gnupg ca-certificates --no-install-recommends && apt-get clean
COPY ./.tool-versions $INSTALL_PATH

The apt-get update && apt-get install -y curl gnupg ca-certificates --no-install-recommends && apt-get clean command is not being loaded from cache, all other commands up to this point are loaded from cache. The logs show:

No cached layer found for cmd RUN apt-get update && apt-get install -y curl gnupg ca-certificates --no-install-recommends && apt-get clean

This same Dockerfile built in Docker makes use of the cache, making build times much faster.

Triage Notes for the Maintainers

Description Yes/No
Please check if this a new feature you are proposing
  • - [ ]
Please check if the build works in docker but not in kaniko
  • - [x]
Please check if this error is seen when you use --cache flag
  • - [x]
Please check if your dockerfile is a multistage dockerfile
  • - [x]
joeauty commented 4 months ago

Perhaps what might help me here is a more fundamental understanding of how caching works. How does the cache algorithm know what the output of the Dockerfile command is going to be before it is run to know whether the cached layer is valid for the command?

I'm wondering if the issue here has something to do with apt, pip, or gem package lists or the like.

rcaillon-Iliad commented 4 months ago

Over the past few days, I've also encountered this cache problem in multi-stage builds. It worked fine for months before and I haven't changed anything.

ukd1 commented 3 months ago

I'm also getting this, and I can see the cache hash it's looking for -does- exist, but it's not pulled:

...
INFO[2024-08-12T02:44:10Z] Checking for cached layer 
registry.gitlab.xxxx.xxxx/xxxx/cache:68130d05ac234eaae199bd7052a9898bb4df73c3517b830fb2e98923e488fcc3... 
INFO[2024-08-12T02:44:10Z] No cached layer found for cmd RUN apt-get update -qq &&     apt-get install --no-install-recommends -y build-essential curl git libpq-dev libvips pkg-config unzip 
...

68130d05ac234eaae199bd7052a9898bb4df73c3517b830fb2e98923e488fcc3 exists, but isn't used:

image
leeeunsang-tmobi commented 2 months ago

The tag keep changing even though there are no changes. When I run the docker build command with the same Dockerfile, cache layer works expected.

[36mINFO[0017] Executing 0 build triggers                   
INFO[0018] Building stage 'base' [idx: '1', base-idx: '0'] 
INFO[0018] Checking for cached layer xxxxxxxxxxxxxxxxxxxxxxxxxxxx:39478ea256ca812a762b7e6c93725c317e9f646dd50a3d105f91bc87cc690958... 
INFO[0018] No cached layer found for cmd RUN xxxxxxxxxxxxxxxxxxxxxxxxxxxx

The analysis is wrong. Ignore the following.

스크린샷 2024-08-21 오전 11 31 57
hleal18 commented 2 months ago

Seems related to #3254

Our tests revealed that using WORKDIR on a multi-stage build causes this issue, specially with RUN instructions for apt-get update/install commands like:

...
WORKDIR /app

FROM base as build

RUN apt-get update -qq && \
    apt-get install --no-install-recommends -y build-essential curl git libpq-dev node-gyp pkg-config python-is-python3

...

A workaround that worked for us was to either remove the workdir directive, or duplicate it across all stages. After that the RUN instruction started using the cache correctly, as previously a different hash was being generated even when nothing changed.

This was definitely not an issue before.

joeauty commented 2 months ago

Seems related to #3254

Our tests revealed that using WORKDIR on a multi-stage build causes this issue, specially with RUN instructions for apt-get update/install commands like:

...
WORKDIR /app

FROM base as build

RUN apt-get update -qq && \
    apt-get install --no-install-recommends -y build-essential curl git libpq-dev node-gyp pkg-config python-is-python3

...

A workaround that worked for us was to either remove the workdir directive, or duplicate it across all stages. After that the RUN instruction started using the cache correctly, as previously a different hash was being generated even when nothing changed.

This was definitely not an issue before.

Unfortunately this did not fix my issue, unless there is some problem with using a variable to as an assignment WORKDIR?

# ---- BASE IMAGE ----
FROM ruby:3.3.4-slim-bullseye as base-image

ENV INSTALL_PATH /data/go
ENV GETTEXT_LOCALES_PATH  $INSTALL_PATH/config/gettext_locales
ENV GETTEXT_CLIENT_LOCALES_PATH $INSTALL_PATH/client/locales
WORKDIR $INSTALL_PATH

RUN apt-get update && apt-get install -y libicu-dev libpq-dev python3-pip python-dev build-essential --no-install-recommends && apt-get clean \
  && rm -rf /var/lib/apt/lists/* \
  && pip install --upgrade setuptools pip \
  && pip install awscli \
  && pip cache purge \
  && gem update --system 3.5.13 \
  && gem install bundler:2.5.13

# ---- BUILD DEPENDENCIES ----
FROM base-image as build-dependencies

ENV INSTALL_PATH /data/go
ENV NODE_MAJOR 20
WORKDIR $INSTALL_PATH

SHELL ["/bin/bash", "-lc"]

RUN apt-get update && apt-get install -y curl gnupg ca-certificates --no-install-recommends && apt-get clean
nielsavonds commented 1 month ago

We ended up moving the WORKDIR directive after any RUN directive wherever possible and it resolved it for us.

joeauty commented 1 month ago

We ended up moving the WORKDIR directive after any RUN directive wherever possible and it resolved it for us.

Unfortunately, that does not work for me. Of course, I'm stating the obvious, but being able to drop in Kaniko without touching the Dockerfile at all would be ideal.

mzihlmann commented 1 month ago

make sure that the directory exists before calling WORKDIR. If the directory does not exist kaniko is kind enough to create it for you, but not kind enough to also put that layer into cache (come to think of it I should probably open a bug ticket for that). Which means that a new layer is emitted every time you pass the workdir instruction. Inside the same build it's not immediately obvious as you will get 100% cache hitrate, however all the layers are new so push will be slower and you will pull a completely new image thereafter. In multistage builds, or builds that run on top of other images created with kaniko this causes huge problems, as now the cache gets invalidated for them. workaround is simple enough:

RUN mkdir $INSTALL_PATH
WORKDIR $INSTALL_PATH
mzihlmann commented 1 month ago

there you go https://github.com/GoogleContainerTools/kaniko/issues/3340