Use registry as external cache for image builds

ableuler commented 4 years ago

Is your feature request related to a problem? Please describe. Builds (for example after forking a project) can take a long time when they are assigned to a runner which does not have the necessary layers in the cache

Describe the solution you'd like According to this reference it should be possible to point to a registry and use it as build cache. Could this solve our problem?

rokroskar commented 4 years ago

how is that different from doing a docker pull before the build? I guess it automates the process.

fgeorgatos commented 4 years ago

@ableuler : thanks for proposing this.

I understand that, this approach would allow to avoid reliance on a local cache on a node, which in turn would make all gitlab-runners eligible for fast builds, irrespective of layers cache on a node.

If that statement is correct, it implies that gitlab-runner becomes yet another service that is able to run and switch on random k8s nodes, without penalty in speed by moving it across nodes.

It would have also allowed us to deliver a Renku instance faster, since operations become simpler.

rokroskar commented 4 years ago

If that statement is correct, it implies that gitlab-runner becomes yet another service that is able to run and switch on random k8s nodes, without penalty in speed by moving it across nodes.

Not really - if you use docker for builds you need to allow privileged containers to execute user code which is not ideal. There are other solutions out there like kaniko which we have running in the dev cluster - it works, we would need to devote some time to figure out where it breaks before giving it to the users as the default.

The builds will never be as fast as with a local cache because you need to pull the layers from the registry. This is (afaik) the equivalent of pulling the previously built image from the registry and allowing the local docker build to reuse those layers.

fgeorgatos commented 4 years ago

well, let's consider using this:

variables:
  DOCKER_TLS_CERTDIR: ''
services:
  - docker:dind

Then, the cache would not be playing along, so it needs to be helped.

Of course, dind has its own issues. BUT, the above allows to deploy -in one go- a Renku setup that is functional in building images.

rokroskar commented 4 years ago

Yes, dind is not ideal: https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/

fgeorgatos commented 4 years ago

we came across this topic also when examining Renku over Openshift, last spring: https://swiss-data-science.slack.com/archives/GGH1M8675/p1558470359007200

Even if dind comes with its baggage, at least it circumvents the need for intervention with

-v /var/run/docker.sock:/var/run/docker.sock. Although it would eventually work, it has become an operational pain, time and again.

So, what I'd favour is an approach that works for all cases - and then can be improved upon.

rokroskar commented 4 years ago

Why is that an operational pain? Sounds like we need to improve the doc, but the setup is pretty straightforward.

rokroskar commented 4 years ago

See https://docs.gitlab.com/ee/ci/docker/using_docker_build.html and the warnings about dind there

fgeorgatos commented 4 years ago

it is operational pain because we have to step out from the k8s/helm design and rely on external resources, typically via a VM. It does have an impact on minimum Renku footprint.

btw. on the link above over dind, we read somewhere: one of the big design decisions was to gather all the container operations under a single daemon and be done with all that concurrent access nonsense.

Following the same principle, it makes sense to centralise layers management at the registry, without needing to rely on any particular node's state - a current design that makes the runner appear as a "special service" that's not under k8s. I believe we could do better than that, no?

Ultimate goal, would be to be able to rely on this chart, at least for subset of Renku users: https://docs.gitlab.com/runner/install/kubernetes.html#installing-gitlab-runner-using-the-helm-chart For that to work optimally, I understand what @ableuler has proposed as a prerequisite - please correct me in it, if it is not the case.

rokroskar commented 4 years ago

yes, that's the chart that is running on our dev deployment - but not in privileged mode.

fgeorgatos commented 4 years ago

OK, just to be clear, i'm not implying that using dind+cache is the last word on this topic. But would be a better trade-off between features and operational convenience, for now.

Addendum; As regards gitlab-runner over kaniko, I think this is the "state of the art", ie. ongoing: https://gitlab.com/gitlab-org/charts/gitlab/issues/1871#note_281842986 Also: https://github.com/GoogleContainerTools/kaniko#comparison-with-other-tools

rokroskar commented 4 years ago

You need more work on the deployment aspect to do this - you need to isolate those containers, preferably to a different namespace. You can't just run privileged containers that users have access to anywhere in the cluster.

fgeorgatos commented 4 years ago

@rokroskar : i think you would agree that, that is more of a docker problem, than a renku problem...

rokroskar commented 4 years ago

Sure, but if we are using docker to deploy the platform it's our problem. Kaniko (and other similar projects) give us image builds in unprivileged containers for free, we just have to spend a bit more time evaluating them. The biggest reason why I paused on Kaniko was that there were some issues last year about caching - maybe this has been resolved, I'm not sure. I have an example project here in case you're interested: https://dev.renku.ch/gitlab/rokroskar/test/-/jobs/2012

fgeorgatos commented 4 years ago

I believe having the gitlab-runner(s) using the registry as a cache and in their own namespace and tainted to a specific node(s) gives us more flexibility and more forward-looking design, than having that in a VM. I am with impression that we hijacked the original @ableuler intention into caching for a topic which may be worthy of its own place, so feel free to diverge to slack etc.

ableuler commented 4 years ago

how is that different from doing a docker pull before the build? I guess it automates the process.

Without having tried I would expect the external cache to be a bit smarter:

No need to specify a certain tag to pull and reuse
Only pull the reusable layers from the registry, not the entire image

I agree that in the case of forking, simply pulling the image in advance should be as good.

rokroskar commented 4 years ago

I tried this on a project for Mark Robinson's UZH class: https://renkulab.io/gitlab/rok.roskar/bio334_spring2020/-/jobs/37373

Seems to work. The annoying thing is that you have to specify the image and tag that you want to use the cached layers from (the cache in the builder is a bit smarter and just finds the layers if they exist). I got around this by always pushing latest and then using latest as the cache in the next build. I guess this is ok but not ideal. cc @emmjab and @pameladelgado

SwissDataScienceCenter / renku

Use registry as external cache for image builds #788