Project to manage Flux tasks needed to standardize kubernetes HPC scheduling interfaces
ci: add automated and on demand testing of fluence #49

vsoch commented 7 months ago

Problem: we cannot tell if/when fluence builds will break against upstream Solution: have a weekly run that will build and test images, and deploy on successful results. For testing, I have added a complete example that uses Job for fluence/default-scheduler, and the reason is because we can run a container that generates output, have it complete, and there is no crash loop backoff or similar. I have added a complete testing setup using kind, and it is in one GitHub job so we can build both containers and load into kind, and then run the tests. Note that MiniKube does NOT appear to work for custom schedulers - I suspect there are extensions/plugins that need to be added. Finally, I was able to figure out how to programmatically check both the pod metadata for the scheduler along with events, and that combined with the output should be sufficient (for now) to test that fluence is working.


In summary this PR:

Interesting Things I Learned

I found these commands useful to checking scheduler assignment. The first is the schedulerName (generated from the job)

default_scheduled_by=$(kubectl get pod ${pod} -o json | jq -r .spec.schedulerName)
# either "fluence" or "default-scheduler"

That worked for both. But it might be the case that the schedulerName we provide is not actually the one assigned (or maybe it doesn't run if it can't be satisfied, I'm not sure). Either way, makes sense to check via the event. And getting the event was more tricky - in both cases I was interested in the "Reason" -> "Scheduled." For fluence, I found the name under .reportingComponent, and for the default-scheduler that field was blank, and I found it under .source.component. For those interested, here are two events to compare.

Default Scheduler "Scheduled" Event ```json { "kind": "Event", "apiVersion": "v1", "metadata": { "name": "default-job-vrkwd.17a1c42e4317ec63", "namespace": "default", "uid": "bc1c4fa7-f8c8-41d1-8dce-7e055d4f2eac", "resourceVersion": "845", "creationTimestamp": "2023-12-18T00:03:57Z" }, "involvedObject": { "kind": "Pod", "namespace": "default", "name": "default-job-vrkwd", "uid": "09b44c86-4cce-4d5a-986a-ef062a8715a8", "apiVersion": "v1", "resourceVersion": "841" }, "reason": "Scheduled", "message": "Successfully assigned default/default-job-vrkwd to kind-control-plane", "source": { "component": "default-scheduler" }, "firstTimestamp": "2023-12-18T00:03:57Z", "lastTimestamp": "2023-12-18T00:03:57Z", "count": 1, "type": "Normal", "eventTime": null, "reportingComponent": "", "reportingInstance": "" } ```

And for fluence we actually see that source is empty (the opposite)

Fluence "Scheduled" Event ```json { "kind": "Event", "apiVersion": "v1", "metadata": { "name": "fluence-job-wmjqj.17a1c42e39ffd2c6", "namespace": "default", "uid": "51ab5daf-28d2-4b0f-9708-fb89870c89e6", "resourceVersion": "838", "creationTimestamp": "2023-12-18T00:03:56Z" }, "involvedObject": { "kind": "Pod", "namespace": "default", "name": "fluence-job-wmjqj", "uid": "32c1771d-bc75-4368-bb8c-b9761ba34aef", "apiVersion": "v1", "resourceVersion": "834" }, "reason": "Scheduled", "message": "Successfully assigned default/fluence-job-wmjqj to kind-control-plane", "source": {}, "firstTimestamp": null, "lastTimestamp": null, "type": "Normal", "eventTime": "2023-12-18T00:03:56.943351Z", "action": "Binding", "reportingComponent": "fluence", "reportingInstance": "fluence-fluence-7d6c87f5cf-6cplb" } ```

I thought that was interesting - it must be designed that the default-scheduler is not considered an extra component (and fluence is) and fluence is not considered some core kubernetes source. I have no idea, I'll probably Google around / ask people about that subtle difference. So here is the jq fu (jq is the best tool!) to get the exact output for each:

kubectl events --for pod/${fluence_job_pod} -o json  | jq -c '[ .items[] | select( .reason | contains("Scheduled")) ]' | jq -r .[0].reportingComponent
kubectl events --for pod/${default_job_pod} -o json  | jq -c '[ .items[] | select( .reason | contains("Scheduled")) ]' | jq -r .[0].source.component

This might take a few iterations to get working in CI (I haven't used this setup kind action before) and I can ping folks when it is done.

Ok, everything is set. Ping @cmisale and @milroy for review, and of course no rush, it's ready when we need it!

milroy commented 6 months ago

Adds a testing workflow, with triggers for on demand, weekly testing, and pull requests (deploy on all bug pull request) [...] Updates the README to reflect the above, and removes "Under Construction" because (after these) we will be pretty good to not be in that state.

This will be extremely helpful to reduce drift from upstream.

vsoch commented 6 months ago

Huge agree! It will be much easier to fix tiny issues that pop up along the way, and I volunteer to take charge of monitoring that (and opening PRs with any fixes that are needed). This setup is also useful for making sure the containers (fluence and sidecar) we are deploying (at the same frequency) are provided with the latest build (that works) combined with kubernetes-sigs/scheduler-plugins.