elastic / kibana

Your window into the Elastic Stack
https://www.elastic.co/products/kibana
Other
19.71k stars 8.12k forks source link

Feature request: leverage Docker cached layers for faster build/pull times #158390

Open afharo opened 1 year ago

afharo commented 1 year ago

Describe the feature: AFAIK. Currently, our CI workers don't share any docker caching layers. This results in every CI-built Docker image running all the steps in our Dockerfile.

Apart from being a lengthier build process, it results in totally different Docker images, requiring longer pull times and larger image sizes.

Describe a specific use case for the feature: Any Docker-based deployment takes longer to upgrade (I've seen logs from a k8s cluster claiming that downloading a Kibana image takes around 30s).

elasticmachine commented 1 year ago

Pinging @elastic/kibana-operations (Team:Operations)

watson commented 1 year ago

@afharo Thanks for opening the issue. I've just been digging into this with help from @mistic and @jbudz, and as far as I understand it, once the Kibana Docker image is build for a particular CI run we never invalidate or re-build it.

The only thing that we can see that can be optimised currently, is adding a shared download cache between the CI workers to cache Docker images that needs to be downloaded (mainly the Ubuntu image).

I'm not ruling out that I might have misunderstood the problem, or maybe that you are seeing some issues that we're not aware of. If so, please clarify

jonathan-buttner commented 1 year ago

@watson did you mean to ping @jbudz ?

watson commented 1 year ago

@jonathan-buttner ups, you're right. Thanks for clearing that up

afharo commented 1 year ago

is adding a shared download cache between the CI workers to cache Docker images

Yeah! That'd help. AFAIK, it doesn't only apply to the FROM step, but shared cache also applies to any following steps (as long as the inputs don't change): like https://github.com/elastic/kibana/blob/f28b694e5aafb2322b984f79a0455e82b2b73ad6/src/dev/build/tasks/os_packages/docker_generator/templates/base/Dockerfile#L14-L19

The thing is: IIUC, sharing the cache also improves the experience in docker pull: because 2 images share the same N first steps in the Docker file, all those layers are reused from a previously pulled image and upgrading is faster as the client skips downloading and extracting those layers.

Refer to this guide on the best practices of dockerizing a Node.js app: https://nodejs.org/en/docs/guides/nodejs-docker-webapp. Bear in mind they first copy the package*.json files and npm install them. Then, they copy the app's source (typically the part of the app that changes more often). This makes that subsequent docker pulls only need to download the diff.

Of course, the largest layer in our Dockerfile is still related to handling the .tar.gz file, but we can optimize that later in the process. And we could even discuss moving steps around if we think they can be reused.

WDYT?

watson commented 1 year ago

Thanks for the details ☺️

Refer to this guide on the best practices of dockerizing a Node.js app: https://nodejs.org/en/docs/guides/nodejs-docker-webapp. Bear in mind they first copy the package*.json files and npm install them. Then, they copy the app's source (typically the part of the app that changes more often). This makes that subsequent docker pulls only need to download the diff.

In our case, the Kibana Node.js app isn't built as a step in the Dockerfile, but instead in a separate process before Docker is even invoked. It's built and stored in a tar ball which is then copied into the container during a step in the Dockerfile:

https://github.com/elastic/kibana/blob/1ba8be4b8a247665986bb894632fcae07dee0e1d/src/dev/build/tasks/os_packages/docker_generator/templates/base/Dockerfile#L30

Since almost all CI runs will run with a slightly different version of the Kibana source code, this step will almost always invalidate everything below it. So maybe there's something here that can be improved by rearranging the lines. And let's definitely look into caching the Docker and OS package-manger pulls 👍

Ikuni17 commented 1 year ago

@watson A few ideas I have for this, not entirely sure on the viability or best approach.

  1. Differ extraction of the Kibana tarball until the image is ran for the first time. This would reduce image size by ~60% and decrease pull times significantly. Essentially, we would move some of the permissions commands after the following COPY line into an ENTRYPOINT script and extract the tarball into a mounted volume.

https://github.com/elastic/kibana/blob/1ba8be4b8a247665986bb894632fcae07dee0e1d/src/dev/build/tasks/os_packages/docker_generator/templates/base/Dockerfile#L109-L110

  1. Use a single OS for the builder step and caching the artifact download and extraction. This allows reuse for each OS instead of doing this work per OS.

  2. Extend the base OS images with the package manager setup we require and host in our repo. Possibly updating them on a set interval?

watson commented 1 year ago

Differ extraction of the Kibana tarball until the image is ran for the first time.

Have you measured how much extra boot-time this would add?

Use a single OS for the builder step and caching the artifact download and extraction. This allows reuse for each OS instead of doing this work per OS.

Really good idea! 💯

Extend the base OS images with the package manager setup we require and host in our repo. Possibly updating them on a set interval?

Have we measured how long it takes to build those layers? I'm worried about adding the maintenance overhead, so knowing how much time can be shaved off would be interesting