Open ableuler opened 4 years ago
how is that different from doing a docker pull before the build? I guess it automates the process.
@ableuler : thanks for proposing this.
I understand that, this approach would allow to avoid reliance on a local cache on a node, which in turn would make all gitlab-runners
eligible for fast builds, irrespective of layers cache on a node.
If that statement is correct, it implies that gitlab-runner
becomes yet another service that is able to run and switch on random k8s
nodes, without penalty in speed by moving it across nodes.
It would have also allowed us to deliver a Renku instance faster, since operations become simpler.
If that statement is correct, it implies that gitlab-runner becomes yet another service that is able to run and switch on random k8s nodes, without penalty in speed by moving it across nodes.
Not really - if you use docker for builds you need to allow privileged containers to execute user code which is not ideal. There are other solutions out there like kaniko which we have running in the dev cluster - it works, we would need to devote some time to figure out where it breaks before giving it to the users as the default.
The builds will never be as fast as with a local cache because you need to pull the layers from the registry. This is (afaik) the equivalent of pulling the previously built image from the registry and allowing the local docker build to reuse those layers.
well, let's consider using this:
variables:
DOCKER_TLS_CERTDIR: ''
services:
- docker:dind
Then, the cache would not be playing along, so it needs to be helped.
Of course, dind
has its own issues.
BUT, the above allows to deploy -in one go- a Renku setup that is functional in building images.
Yes, dind is not ideal: https://jpetazzo.github.io/2015/09/03/do-not-use-docker-in-docker-for-ci/
we came across this topic also when examining Renku over Openshift
, last spring:
https://swiss-data-science.slack.com/archives/GGH1M8675/p1558470359007200
Even if dind
comes with its baggage, at least it circumvents the need for intervention with
-v /var/run/docker.sock:/var/run/docker.sock
.
Although it would eventually work, it has become an operational pain, time and again.So, what I'd favour is an approach that works for all cases - and then can be improved upon.
Why is that an operational pain? Sounds like we need to improve the doc, but the setup is pretty straightforward.
See https://docs.gitlab.com/ee/ci/docker/using_docker_build.html and the warnings about dind there
it is operational pain because we have to step out from the k8s
/helm
design and rely on external resources, typically via a VM. It does have an impact on minimum Renku footprint.
btw. on the link above over dind, we read somewhere:
one of the big design decisions was to gather all the container operations under a single daemon and be done with all that concurrent access nonsense.
Following the same principle, it makes sense to centralise layers management at the registry, without needing to rely on any particular node's state - a current design that makes the runner appear as a "special service" that's not under k8s. I believe we could do better than that, no?
Ultimate goal, would be to be able to rely on this chart, at least for subset of Renku users: https://docs.gitlab.com/runner/install/kubernetes.html#installing-gitlab-runner-using-the-helm-chart For that to work optimally, I understand what @ableuler has proposed as a prerequisite - please correct me in it, if it is not the case.
yes, that's the chart that is running on our dev deployment - but not in privileged mode.
OK, just to be clear, i'm not implying that using dind
+cache is the last word on this topic.
But would be a better trade-off between features and operational convenience, for now.
Addendum;
As regards gitlab-runner
over kaniko
, I think this is the "state of the art", ie. ongoing:
https://gitlab.com/gitlab-org/charts/gitlab/issues/1871#note_281842986
Also: https://github.com/GoogleContainerTools/kaniko#comparison-with-other-tools
You need more work on the deployment aspect to do this - you need to isolate those containers, preferably to a different namespace. You can't just run privileged containers that users have access to anywhere in the cluster.
@rokroskar : i think you would agree that, that is more of a docker
problem, than a renku
problem...
Sure, but if we are using docker to deploy the platform it's our problem. Kaniko (and other similar projects) give us image builds in unprivileged containers for free, we just have to spend a bit more time evaluating them. The biggest reason why I paused on Kaniko was that there were some issues last year about caching - maybe this has been resolved, I'm not sure. I have an example project here in case you're interested: https://dev.renku.ch/gitlab/rokroskar/test/-/jobs/2012
I believe having the gitlab-runner
(s) using the registry as a cache and in their own namespace and tainted to a specific node(s) gives us more flexibility and more forward-looking design, than having that in a VM. I am with impression that we hijacked the original @ableuler intention into caching
for a topic which may be worthy of its own place, so feel free to diverge to slack etc.
how is that different from doing a docker pull before the build? I guess it automates the process.
Without having tried I would expect the external cache to be a bit smarter:
I agree that in the case of forking, simply pulling the image in advance should be as good.
I tried this on a project for Mark Robinson's UZH class: https://renkulab.io/gitlab/rok.roskar/bio334_spring2020/-/jobs/37373
Seems to work. The annoying thing is that you have to specify the image and tag that you want to use the cached layers from (the cache in the builder is a bit smarter and just finds the layers if they exist). I got around this by always pushing latest
and then using latest
as the cache in the next build. I guess this is ok but not ideal. cc @emmjab and @pameladelgado
Is your feature request related to a problem? Please describe. Builds (for example after forking a project) can take a long time when they are assigned to a runner which does not have the necessary layers in the cache
Describe the solution you'd like According to this reference it should be possible to point to a registry and use it as build cache. Could this solve our problem?