TheRacetrack / racetrack

An opinionated framework for deploying, managing, and serving application workloads
https://theracetrack.github.io/racetrack/
Apache License 2.0
28 stars 6 forks source link

Make a job from pre-built image #503

Closed iszulcdeepsense closed 2 months ago

iszulcdeepsense commented 2 months ago

We've got Dockerfile job type to have a better control over the Docker image of a job. In theory, this gives possibility to build the same job image on its own, but after submitting a job to Racetrack, Racetrack will build the image again anyways. In a scenario with CI pipeline, it seems unnecessary to build the same image multiple times. The scenario is as follows:

  1. To make sure the job is correct, it's being tested in a CI pipeline. The job image is being built from a Dockerfile and verified by a test suite.
  2. After successful verification, it may go to dev/test/prod environment. However, on job submission, Racetrack will build it again, triggering image-builder, pushing to registry, etc.; thus, running the different image as a result.

Instead, Racetrack could make use of the pre-built image, being already in the Docker Registry. This would make sure that the same, verified image gets deployed. Plus, this would ensure the same image runs on all environments. Building it from scratch doesn't really guarantee the outcome image would be really the same (for instance if versions of Racetrack are different, or someone just pushed another commit in the meantime).

A new job type

Most likely, it would be another job type plugin that would allow to deploy a pre-built job provided the URL of the Docker image from the registry. Some changes in the Racetrack core would be needed to circumvent the image-builder step. The example manifest would be:

name: my-pre-built-job
version: 0.1.2
jobtype: docker-registry:latest
jobtype_extra:
    image_url: ghcr.io/theracetrack/racetrack/my-pre-built-job:0.1.2

This job type would be a step further to what Dockerfile job type does.

100% Reproducible builds

Unfortunately, the job image that was built locally is not 100% the same as the one built by Racetrack's image-builder. There is a slight difference: image-builder creates additional outcome manifest job.yaml in a working directory of a job: https://github.com/TheRacetrack/racetrack/blob/e55a1e79f36cfc5db86c7ce3e2f78e9082b9dddf/image_builder/image_builder/build.py#L72

Currently, it causes the issue with job-runner-python-lib that expects a combined job.yaml. So that in case of using multiple manifest layers, it only works when the image is built by Racetrack, not locally.

(idea originally conceived by @ahnsn )

JosefAssadERST commented 2 months ago

Interesting idea. What is the specific need or problem from which this idea emerged?

iszulcdeepsense commented 2 months ago

Interesting idea. What is the specific need or problem from which this idea emerged?

They started testing jobs in CI pipeline and noticed that the "same" image didn't work on CI but was deployed successfully in Racetrack. They'd like to make sure they're working on the same images.

JosefAssadERST commented 2 months ago

Interesting idea. What is the specific need or problem from which this idea emerged?

They started testing jobs in CI pipeline and noticed that the "same" image didn't work on CI but was deployed successfully in Racetrack. They'd like to make sure they're working on the same images.

Hm alright. Not to be pedantic, but it seems to me like the pipeline's problem not RT.

I'm tempted to push back a bit. This change does increase the functional footprint of RT. It goes from "and then RT builds the deployable payload" to "and then RT either builds the deployable payload, or gets it from a docker registry, or some other repository for build artifacts, or does something else entirely to obtain the built artifact". It's cognitively simpler if the only thing RT ever does is build.

Maybe the CI pipeline can run it in a KinD RT? Or maybe RT can have a "build but don't deploy just yet" flag, and then the CI pipeline can wget the image RT built?

I won't insist on any of this, but I fear the size of the proposed change is bigger in the long run than it seems when just seen form the perspective of this specific need.

iszulcdeepsense commented 2 months ago

Good thinking about breaking the standard deployment procedure. But perhaps this problem disappears, if we put it another way. In fact, what job types and image builder actually do is not necessarily building the image, but to make sure the proper image is in the docker registry. Sometimes, it's just about to find it's already in cache and do nothing. So if we say that job types are supposed to return the URL of the image, we could have a dummy job type that would do nothing but return the pre-built image. And it would fit into this standard functional workflow. Yes, it's a bit of bending the rules, but maybe it's a more flexible system.

I find this feature useful in terms of reproducibility. Imagine that you could take exactly the same image that someone else used, to reproduce the unexpected behaviour. Fun fact: I was looking for this kind of improvement when I was debugging something recently. It would have saved me time back then.

JosefAssadERST commented 2 months ago

Alright that's good input. First thing that strikes me:

In fact, what job types and image builder actually do is not necessarily building the image, but to make sure the proper image is in the docker registry

Is "image builder" the right name then for this component? In theory, Anna could build her docker image by hand and upload it to the docker registry, and thereby bypass RT building the image entirely. Is my understanding correct? I'm not going to criticise the functional aspect of this, but two things bother me about it:

  1. image builder has the wrong name. It only builds sometimes, conditionally.
  2. It's implicit behavior. It's implicit that you can bypass RT building.
iszulcdeepsense commented 2 months ago

Yeah, you got me on that "image builder". Something's clearly odd about that idea. I'm not insisting on that, I just tried to look at it from different points of view. Now I think if we're going to do it at all, it would be better to bypass building in an explicit way.

JosefAssadERST commented 2 months ago

And if the user hand-builds an image and puts it in the registry, and then deploys the job, then RT finds this image and thinks "OK I'll just use that". Is that secure?

In other terms which might encourage your imagination to create a few less pleasant scenarios: a user comes to you and says a job isn't doing what it's supposed to. Do you want to wonder where the image came from and how it was built?

iszulcdeepsense commented 2 months ago

That's another really good point. Administering might turn into a nightmare because of this. Now I wonder about Dockerfile job types that are already in use. Hasn't this risk already come into effect?

Is that secure?

I guess it's no less secure than using the Dockerfile job type and forcing Racetrack to build and run any arbitrary image. At least with Dockerfile job types it looks a bit better. If something's wrong, Racetrack keeps the build logs to peek into, if needed.

JosefAssadERST commented 2 months ago

Administering might turn into a nightmare because of this

Provenance and administration is a lot cleaner when what goes in to RT is source and a manifest, and what comes out the other end is 100% of the time produced by RT.

That's why my first thought for the users testing their images was to work on the CI to make it match RT.

iszulcdeepsense commented 2 months ago

One more thought: On the other hand, this "feature" may reduce the number of times someone complains that a job doesn't work. Building jobs on their own tightens the feedback loop and gives users a way to find and fix bugs themselves. (Plus, it also creates more places where Racetrack has a chance to break)

iszulcdeepsense commented 2 months ago

To sum up: I think we're not convinced of this idea at the moment. Our recommendation would be to use the full setup of Racetrack in CI.

JosefAssadERST commented 2 months ago

full setup of Racetrack in CI

Not KinD?

iszulcdeepsense commented 2 months ago

Not KinD?

Yes, I meant KinD actually.

iszulcdeepsense commented 2 months ago

The original issue about not really reproducible builds caused by multi-layer manifest will be solved as https://github.com/TheRacetrack/job-runner-python-lib/issues/5 as this is the problem in a library. I'm closing this issue.

Fun fact: Here's how to deploy a pre-built image. Push your image to registry and use Dockerfile job type with the following Dockerfile:

FROM ghcr.io/theracetrack/racetrack/my-pre-built-job:0.1.2