Update fluence to use flux-sched 0.31.0 and build from kubernetes-sigs/scheduler-plugins

vsoch commented 7 months ago

This is the first part of a two part refactor that will update the current main branch with enough changes to be in the "same" state, but with updated underlying libraries and building from kubernetes-sigs/scheduler-plugins.

Summary

Specifically:

the current scheduler-plugins directory is renamed to src to make it clearly the source code https://github.com/flux-framework/flux-k8s/commit/5dd90af286f91b2cf1b082d3f54107e73ef5f999
the Dockerfile is updated to use the flux-sched latest (which is version 0.31.0) https://github.com/flux-framework/flux-k8s/commit/2c29a650f42ecaa3f260f785ff5247e9df239607. This commit was enough to build the container, although I had not tested yet. The corresponding changes in the README/build I was writing as I was testing, so wasn't able to easily separate.
I have found that remembering how to do all the builds is hard, so I added a Makefile that can do them in isolation, or all at once. https://github.com/flux-framework/flux-k8s/commit/22b3bfe5f3d16f32dca1220abdb7520e0ed72687
Of course then at the end I saw that we were still importing the old hlapi from @cmisale branch! Oups (I think I missed this). So I needed to figure out updating the module (required a go get with the commit for 0.31.0) and then updating all of the code to use it. https://github.com/flux-framework/flux-k8s/pull/47/commits/105562e4662e86a1b8ce1e19762678a6c0dd6309

Notes and Questions

My one question for the last commit is about the error code we pass here:

- dr := &pb.CancelResponse{JobID: in.JobID, Error: int32(err)}
+ dr := &pb.CancelResponse{JobID: in.JobID}

I didn't understand it because we would return from the function is the error is less than 0 (indicating an error). So then wouldn't the error value passed not be reflective of the job itself, but of the error of cancelling it? If we want the error code there to be for the job (why the job errored) we would need to find that somewhere. if it's for cancel, then it's always going to be 0 (and we can explicitly set it or leave it out). I just wanted to double check that I understand the logic.

Note that I did the above carefully pinning flux-sched 0.31.0 (most recent release) both for the build and module import. I think we will likely want to update this when we can get SHARED instead of STATIC (and remove the sed statements in the Dockerfile.

Finally, the next step (after this work) and aside from the issue with SHARED libraries above is to add the automated builds/testing that happen regularly, and push of tags that correspond to that. I wanted to do it in two pieces because this PR is already too big!

Ping @cmisale @milroy

milroy commented 6 months ago

I looked at the PR but haven't officially reviewed it because the item @vsoch identified in the last commit requires some thought.

I didn't understand it because we would return from the function is the error is less than 0 (indicating an error).

That is my interpretation, too. Note that the CancelResponse struct defined here and message defined here require an error data member. Perhaps a partially-defined CancelResponse implies default values for unspecified member data? I'm not familiar with gRPC Protocol Buffers.

So then wouldn't the error value passed not be reflective of the job itself, but of the error of cancelling it?

There's one more layer of complexity, I think. s.cli.Cancel can return an error, but the actual cancel graph traversal can return a different error (evidenced by this call: https://github.com/flux-framework/flux-k8s/blob/93b17961473d695e0f0f48e5734fd53113267437/scheduler-plugin/fluence/fluxion/fluxion.go#L62)

On a related note, the other issue I see is that Cancel is also invoking Info. I think that was originally for debugging, but it's not expected behavior in Fluxion.

@cmisale can you weigh in here?

vsoch commented 6 months ago

@milroy thanks for the preliminary review! I will also await @cmisale feedback, but I would say for the work here, we should try to separate these questions / issues from the updates that need to go in soon, which predominantly are getting automated builds and testing in, and fluence back into a working state. My suggested order of steps are:

After review, merge this PR and #49 (this gives us automated testing and updates fluence to work again)
Then discuss the new issues we found here related to cancel (and fix them, I can offer to do work that is needed)
Then these issues for our experiments so we can sanity check fluence working (as it did before) in Google Cloud (with Pod Group)
When that is working, come back to the issues we discussed today about a pod group strategy

The other thing I noticed (that I didn't mention today, but I did put in slack chat) is that the experiments we ran for canopie used a pod group with the same name - "lammps" - which would mean that all the (different) lammps groups coming in would be treated in the same group. Depending on how it was run (e.g., with only one group of lammps in at a time) that might be OK, but I wanted to double check! Here is the original experiment file. The metadata-> name of the pod group is always lammps, so all experiments that were running lammps at the same time would be considered part of the same group.

Let me know your thoughts!

milroy commented 6 months ago

but I would say for the work here, we should try to separate these questions / issues from the updates that need to go in soon

That's fine with me. In that case, could you revert the changes to the Cancel function? We can continue the discussion about the implementation in an issue and address it in a future PR.

vsoch commented 6 months ago

That's fine with me. In that case, could you revert the changes to the Cancel function?

How do you want me to do that? The updated bindings (not served here, but from flux-sched) return a Go error (so nil) and not the integer code. The logic is the same either way, it's just using updated bindings and handling a return value of nil/not nil vs 0/not zero. Here is a side by side comparison of lines if it helps:

The old function "err" was an integer, and the new one is nil or not nil. We return an error in either case (not zero or not nil) so that is functionally equivalent.

- err := fluxcli.ReapiCliCancel(s.fctx, int64(in.JobID), true)
+ err := s.cli.Cancel(int64(in.JobID), true)
# Then if err is not nil or err is not zero, return early

This call is basically the same, just using the updated client (this is where I put my comment)

- dr := &pb.CancelResponse{JobID: in.JobID, Error: int32(err)}
+ dr := &pb.CancelResponse{JobID: in.JobID}

Ditto

- reserved, at, overhead, mode, fluxerr := fluxcli.ReapiCliInfo(s.fctx, int64(in.JobID))
+ reserved, at, overhead, mode, fluxerr := s.cli.Info(int64(in.JobID))

And those are the main changes. So I think the functionality is the same either way, but if we revert back it won't work with the new bindings. Let me know if I'm missing a detail!

milroy commented 6 months ago

The updated bindings (not served here, but from flux-sched) return a Go error (so nil)

Got it, I missed that detail. Is there any advantage to dr := &pb.CancelResponse{JobID: in.JobID, Error: 0} instead of dr := &pb.CancelResponse{JobID: in.JobID}?

vsoch commented 6 months ago

Got it, I missed that detail. Is there any advantage to dr := &pb.CancelResponse{JobID: in.JobID, Error: 0} instead of dr := &pb.CancelResponse{JobID: in.JobID}?

pb is referencing the protocol buffer that (I think) gets built from here and it looks like it takes the jobid as first argument, error as second:

// The Match response message
message CancelResponse {
    int64 jobID = 1;
    int32 error = 2;
}

I think what that command is doing is taking the result we get from flux-sched, and then just serializing it into the CancelResponse proto that we define. Note that the function we call for the error is here - it's either nil or not nil based on the actual integer return value.

We could theoretically put a code 0 there, but since we check for nil above (which would happen if the error code is 0) and return if not, the value is always going to be zero, which is the default for an int32 like that, so it's (unstated) but set. To answer your question, I think the latter would just be more explicit for the developer user to see but functionally the same. If we are going to refactor this we probably don't need to worry about that detail.

milroy commented 6 months ago

Edit: the check for != nil means that that the CancelResponse error code should be 0 in the following lines of code.

I was mainly concerned with 1) what happens when you don't specify a struct data member for CancelResponse, and 2) what is the relationship between the struct and message CancelResponse.

the value is always going to be zero, which is the default for an int32 like that, so it's (unstated) but set

Some of my worries were related to that comment, but assuaged by the Golang spec on "the zero value": https://go.dev/ref/spec#The_zero_value.

We can continue the rest of this discussion in an issue. PR approved.

Excellent work, and thank you @vsoch!

flux-framework / flux-k8s

Update fluence to use flux-sched 0.31.0 and build from kubernetes-sigs/scheduler-plugins #47

Summary

Notes and Questions