GoogleContainerTools / kaniko

Build Container Images In Kubernetes
Apache License 2.0
14.74k stars 1.43k forks source link

Build with kaniko is randomuly crashing after Taking snapshot of full filesystem #2275

Open slamer59 opened 2 years ago

slamer59 commented 2 years ago

Actual behavior The kaniko build silently crashes after taking the full filesystem snapshot with no useful error. Works fine with dind. Disabling the Kaniko cache doesn't help.

Might be related to

Expected behavior Running gitlab CI with kaniko

Leads to this error:

INFO[0195] Taking snapshot of full filesystem...        
Cleaning up project directory and file based variables 00:00
ERROR: Job failed: pod "runner-75bjfbsg-project-35867263-concurrent-1nv49n" status is "Failed"

I don't know how to keep Job in k8s so... I cannot see what happend (I will look on how to keep failing ones)

To Reproduce Steps to reproduce the behavior:

  1. running this CI in gitlab

docker-build: stage: docker-build rules:

lead to a failure.

I can run again the same configuration and it work some time. Here an exemple on docker-build [production] image

Additional Information

# Install dependencies only when needed
FROM node:16-alpine AS deps
# Check https://github.com/nodejs/docker-node/tree/b4117f9333da4138b03a546ec926ef50a31506c3#nodealpine to understand why libc6-compat might be needed.
RUN apk add --no-cache libc6-compat chromium
WORKDIR /app
# COPY package.json yarn.lock ./
# RUN yarn install --frozen-lockfile

# If using npm with a `package-lock.json` comment out above and use below instead
COPY package.json package-lock.json ./ 

RUN npm ci --legacy-peer-deps

# Rebuild the source code only when needed
FROM node:16-alpine AS builder
WORKDIR /app
COPY --from=deps /app/node_modules ./node_modules
COPY . .

RUN npm run build

# Production image, copy all the files and run next
FROM node:16-alpine AS runner
WORKDIR /app

ENV NODE_ENV production
# Uncomment the following line in case you want to disable telemetry during runtime.
# ENV NEXT_TELEMETRY_DISABLED 1

RUN addgroup --system --gid 1001 nodejs
RUN adduser --system --uid 1001 nextjs

# You only need to copy next.config.js if you are NOT using the default configuration
COPY --from=builder /app/next.config.js ./
COPY --chown=nextjs:nodejs --from=builder /app/public ./public
COPY --from=builder /app/package.json ./package.json

# Automatically leverage output traces to reduce image size 
# https://nextjs.org/docs/advanced-features/output-file-tracing
COPY --from=builder --chown=nextjs:nodejs /app/.next/standalone ./
COPY --from=builder --chown=nextjs:nodejs /app/.next/static ./.next/static

USER nextjs

EXPOSE 3000

ENV PORT 3000

CMD ["node", "server.js"]
Wells-Li commented 1 year ago

May be memory limit is the main problem. I got this before too~

Verhaeg commented 1 year ago

I'm having the same issue, pipeline constantly failing and I'm not sure but I think the snapshot is being run on memory and while running on Kubernetes I don't think that would be the best idea as memory is a limited resource specially in this context.

In my case, I need to install some packages and they amount to ~1.1GB plus the already existing data (Alpine based image). But considering several node applications, 1GB unfortunately does not seem a lot as final image.

Running on GKE with Autopilot and Gitlab Helm - memory is capped at 2Gi right now which should be plenty except for this pipeline. Is it possible to disable snapshot? or how to make it use disk instead?

This seems related to: https://github.com/GoogleContainerTools/kaniko/issues/909

gaatjeniksaan commented 1 year ago

We're having this issue as well with 1.9.1-debug. End size of the image should be ~9GB, but the kaniko build (on GKE) fails due to limit in memory. See attached image to share in my agony. image (5)

pimperator commented 1 year ago

I am having the same issues repeatedly (running on gitlab-ci pipelines with eks, memory limits in place) ... the thing is no matter how much memory I give the job it uses everything it gets.

Here are some screenshots on the same job with different reservations/limits:

Bildschirmfoto 2023-10-04 um 09 23 19 Bildschirmfoto 2023-10-04 um 12 31 45 Bildschirmfoto 2023-10-04 um 12 39 44

at least the last one did not fail but both others failed at taking snapshots resulting container size is approximately 280MB

after I ran across with some further issues I noticed this flag: https://github.com/GoogleContainerTools/kaniko#flag---compressed-caching which needs to be set to 'false' on OOM-errors setting that flag on my side resulted in no OOM termination (yet) but scratching the maximum allocateable memory Bildschirmfoto 2023-10-04 um 14 58 27