Use real-life jobs in testing pipelines

anders314159 commented 1 month ago

Following feedback from a project using Racetrack, concerns were raised:

1. Racetrack has to make it easy to implement end-to-end tests for projects with real-life jobs. Questions: Is this not already the case? What is Racetrack missing in order for projects to test real-life jobs on their own?

2. Racetrack can implement test cases that uses (or closely imitates) real-life jobs. Questions: Where to get real-life jobs that can be hosted in a public repo? How would these be kept up-to-date?

3. Alternatively to 2., Racetrack makes a job- and infrastructure agnostic testing template that lets projects provide jobs and infrastructure targets, input/output data, vibes, etc., such that the projects themselves can validate new Racetrack versions in their own environments. Questions: Will anyone actually use this? What do projects actually want/need? How do we strike the right balance between making it easy to implement for projects, but not making it so rigid that projects have no freedom to be creative?

@JosefAssad @iszulcdeepsense @ahnsn @LookACastle @MartinEnevig - thoughts?

JosefAssadERST commented 1 month ago

We're used to runtime config, could something interesting be done here with buildtime config? Specifically, as an optional step during build you can point at N job git remotes and the build tests against them?

LookACastle commented 1 month ago

My understanding is that we want to be able to fetch jobs in some capacity, and then run their tests. That is, a manifest can provide a path to tests, and Racetrack should be able to run them. I'm under the impression this probably means extending what a jobtype can include, in order to enable testing frameworks - or making new plugins for testing frameworks.

Once we can do this, we unlock a couple of things. We can say that a job doesn't go live on prod environments if it fails tests. We can support having prod enforce testing to some extent - maybe even coverage. And most importantly for our own developer ergonomics, we can have a better testing pipeline for racetrack to spot breaking changes.

That is, with good testing that is run Racetrack side, we can steal those from our users to run ourselves to ensure we aren't breaking things.

JosefAssadERST commented 1 month ago

want to be able to fetch jobs

If you're referring to the approach I described, yes.

and then run their tests

If you're referring to the approach I described, then not exactly. Rather:

Grab the job
Deploy it and check if that broke
(maybe?) watch that it doesn't keel over 10 seconds later
Using the sample inputs baked into the endpoints, try hitting those endpoints with that sample data and check it didn't keel over

And so on. The objective isn't to test the job, it's to test RT.

LookACastle commented 1 month ago

I think that by running a swathe of tests from differing jobs, we can offer value to the users, and test RT better. If all we were gunning for were testing RT, I think Anders's suggestion number 2 is the better approach.

JosefAssadERST commented 1 month ago

Hm, let me take a step back and rephrase what I think the objective is here, we might not be on the same wavelength.

We can improve the speed at which a new release hits our 4 deployment environments if we have greater confidence and we can improve confidence if we know that a bunch of things won't break.

One way to do this is to try out real-world jobs in the CI. See if they deploy, if they're catatonic, what have you. The objective isn't to run the test suite of a specific job, that is unlikely to tell you anything about whether RT itself is handling that job fine. The objective is to see if what the previous version could run, this version also can.

Hence, providing the build with links to real jobs, and giving them a whack on that the CI just built. Sure we can't test if a ML job gives a correct prediction, but that's not the objective.

There's a chance we're on the same page but I'm misunderstanding you, just when you say "running a swathe of tests" I get a picture in my mind og a test suite and pytest and so on. That's now what I'm aiming for. I want to catch all the cases (I know it's zero, @iszulcdeepsense you're incapable of creating bugs) where Irek tries to build a new version, it pulls in one of spandexdrengens models, and it breaks because Irek's new version forgot to account for some factor in spandexdrengens job (or more likely, the jobtype it uses).

Does that make sense anywhere that isn't inside my head?

LookACastle commented 1 month ago

The thing I've understood differently is that spandexdrengen, glory be upon him, might not have jobs we can just send dummy requests to. The job might inherently touch sensitive data, change database things, have real world effects (spandexdrenges might run his home automation via racetrack or something).

So, therefore, we make use of the fact that of course, spandexdrengen in his infinite wisdom has comprehensive testing. If he just provides those in a manifest to racetrack, we can catch bugs as you say.

There's a couple of wins here.

We don't have to write tests, they already exist
Racetrack starts having the foundation to be able to implement some sort of automatic detection of bad jobs (test coverage, tests failing, stuff like that)
Users can test things that don't make sense to test locally (chain calls come to mind), making their tests more comprehensive.

I have understood this as us aiming for pytest and so on, in other words. I won't insist on doing it that way, but I'm not sure how we'd do it another way without us having to maintain a test suite for something we have comparatively little control over.

anders314159 commented 1 month ago

What I tried to get at, was two different, somewhat orthogonal, approaches: 2. Our CI test real-life jobs (with all the caveats and wins mentioned above), to catch bugs that that occur IRL. 3. We make it easier for projects to test Racetrack on their own (with all the caveats and wins mentioned above).

Based on vibes, I like 3. best, but it requires some organizational buy-in from the projects - they have to do the work of ~~adding~~ having Racetrack testing to in their tests, so they might prefer it if we, as in approach 2., do most of the work.

I'm collecting feedback from a project on their current testing setup, so this discussion is still very preliminary.

LookACastle commented 1 month ago

I think "adding" racetrack testing is an unusual phrasing, otherwise I agree. After all, end-to-end testing is fairly standard and should touch the infrastructure, be it Racetrack or otherwise.

JosefAssadERST commented 1 month ago

The thing I've understood differently is that spandexdrengen, glory be upon him, might not have jobs we can just send dummy requests to. The job might inherently touch sensitive data, change database things, have real world effects (spandexdrenges might run his home automation via racetrack or something).

Keep in mind, RT has two contexts. This open source one where the code is shared, and the private ERST one where it is built. I'm talking about spandexdrengen hooking his jobs into the build in the latter. None of his top secret collection of cat pictures is shared.

Imagine in the quickstart, it says:

Download RT

If you have jobs you'd like to validate, list their remotes in the file my-test-jobs.yaml, then the build process will try to deploy them and then remove them again as part of the build process. <--- this is new

Click the big red build button on your Razer keyboard

run "deploy this software on my kubernotis cluster please" in your terminal

Grab a coffee then you're ready

So you see nothing is made public. And each individual site - be it us or someone else - can validate new versions as part of the build process to upgrade their RT with a bit more confidence.

So, therefore, we make use of the fact that of course, spandexdrengen in his infinite wisdom has comprehensive testing. If he just provides those in a manifest to racetrack, we can catch bugs as you say.

You mean if I write a job which adds two integers, I ship it with a python test which checks that given 2 and 3 we get 5? Sure that's nice, but it's orthogonal to the need which prompted this issue. When @iszulcdeepsense is sitting with a new RT version and he wants to deploy, he doesn't care if my job's tests pass. He cares if my job keels over and dies in the new version.

MartinEnevig commented 1 month ago

The main thing for us, is testing the ERST version of RT. Our jobs are quite heterogenous and do a lot of different stuff. Therefore, we have experienced that a job might fail after an RT upgrade, even if everything seems to run normally at first glance. I think what we need to test is something like this:

can jobs deploy after an update (this is relatively easy, and a dummy job can be used here)
do they stay up?
are deploy times affected by the update? We have a few very big jobs, and if deploy time is affected they might start timing out.
do our external requests still work?
do the jobs still return what we expect (not the predictions, just if they return the expected data)

There are probably other relevant stuff to test as well, but it's hard to think, when my clothes are so tight

iszulcdeepsense commented 4 weeks ago

Let's also keep in mind that Racetrack lives very close to Kubernetes in most of our environments, and its correctness may also depend on the current state of the Kubernetes at the specific point of time. That's why I keep asking myself a question: Would you rather run tests prior to the release or afterwards? I'd like to hear your opinion, whether you'd like to run the tests on a brand new CI environment (before releasing it to real clusters) or on the existing ones (to verify its integrity and make sure everything works)? I know running tests on the live environment sounds a bit nasty. On the other hand, some tricky bugs might be missed, if tested on the too artificial environment.

JosefAssadERST commented 4 weeks ago

The way the idea looks in my head, when you have a version you want to release and deploy in dev/test/etc., with the capablity described in this issue you can on your development ireneusz cluster plug in a list of 5 or 10 or 20 jobs that someone like @MartinEnevig supplies you with, then you can build and test there. That way, you can say to the team "Can I upgrade the real clusters? It passed all tests with your 20 models."

anders314159 commented 2 weeks ago

After a meeting with ML Lab, consensus seems to be that the ML Lab real-life jobs suitable for end-to-end tests require secrets, environment variables, and special incantations that Everyone™ prefers stay in the hands of ML Lab. This issue is therefore mostly an ML Lab issue.

ML Labs initial idea looks something like:

An E2E pipeline which deploy real-life jobs to a Racetrack, verifies that the jobs do jobby things - some of these things mentioned by Martin above.
Which Racetrack and where it lives should be variable, e.g. deploy to the already existing Racetrack on the X cluster, deploy to a fresh testing-Racetrack on the Y cluster.
Racetrack development team should also be able to trigger this pipeline when a new version of Racetrack is released, but the pipeline is wholly maintained by ML Lab.

I am still not against implementing some sort of "your-jobs-here-and-some-test-will-be-run.yaml"-template in the Racetrack code base that Racetrack users then fill in with their secrets and use in their own way, but I don't think ML Lab needs that, as they can 'just' make a zsh script in the pipeline that does more or less the same, but with more control over the specifics.

JosefAssadERST commented 2 weeks ago

The current consensus is:

ERST Racetrack users will develop their own CI which they will populate with their private jobs, and which deploys them and tests their declared endpoints (/perform most notably). They'll give RT developers a big red button to run this pipeline and check the results.

Is this correct? I'll note, that this deprives RT from offering to other non-ERST users a facility where they just need to plug in their jobs; anyone wanting to check their jobs will need to develop their own code to check endpoints.

It's fine with me, my role is mainly to have ideas bounced off of me. I just want us to be aware of all design consequences.

If it's correct then this issue should also be closed.

anders314159 commented 2 weeks ago

One doesn't preclude the other, but ML Lab's needs are best served by them having their own CI, which we in turn can check new versions against.

I think this issue could stay open. I'm just not gonna do it right now - unless someone tells me I can't get lunch until it is done.

JosefAssadERST commented 2 weeks ago

One doesn't preclude the other

Developing Microsoft Office so it's a userspace program doesn't preclude it from being rolled into the Windows kernel either, but that doesn't mean Microsoft should keep an issue open in their bug tracker for this.

Once Lab's needs are fulfilled there will be very little motivation to reimplement the same functionality RT-side. It's hard for me to see the value of keeping this issue open assuming the consensus I summarised is correct.

anders314159 commented 2 weeks ago

yea, "We COULD do it!" is not a good enough reason to keep the issue open.

TheRacetrack / racetrack

Use real-life jobs in testing pipelines #467