Closed vsoch closed 6 months ago
I looked at the PR but haven't officially reviewed it because the item @vsoch identified in the last commit requires some thought.
I didn't understand it because we would return from the function is the error is less than 0 (indicating an error).
That is my interpretation, too. Note that the CancelResponse
struct defined here and message defined here require an error data member. Perhaps a partially-defined CancelResponse
implies default values for unspecified member data? I'm not familiar with gRPC Protocol Buffers.
So then wouldn't the error value passed not be reflective of the job itself, but of the error of cancelling it?
There's one more layer of complexity, I think. s.cli.Cancel
can return an error, but the actual cancel graph traversal can return a different error (evidenced by this call: https://github.com/flux-framework/flux-k8s/blob/93b17961473d695e0f0f48e5734fd53113267437/scheduler-plugin/fluence/fluxion/fluxion.go#L62)
On a related note, the other issue I see is that Cancel
is also invoking Info
. I think that was originally for debugging, but it's not expected behavior in Fluxion.
@cmisale can you weigh in here?
@milroy thanks for the preliminary review! I will also await @cmisale feedback, but I would say for the work here, we should try to separate these questions / issues from the updates that need to go in soon, which predominantly are getting automated builds and testing in, and fluence back into a working state. My suggested order of steps are:
The other thing I noticed (that I didn't mention today, but I did put in slack chat) is that the experiments we ran for canopie used a pod group with the same name - "lammps" - which would mean that all the (different) lammps groups coming in would be treated in the same group. Depending on how it was run (e.g., with only one group of lammps in at a time) that might be OK, but I wanted to double check! Here is the original experiment file. The metadata-> name of the pod group is always lammps, so all experiments that were running lammps at the same time would be considered part of the same group.
Let me know your thoughts!
but I would say for the work here, we should try to separate these questions / issues from the updates that need to go in soon
That's fine with me. In that case, could you revert the changes to the Cancel
function? We can continue the discussion about the implementation in an issue and address it in a future PR.
That's fine with me. In that case, could you revert the changes to the Cancel function?
How do you want me to do that? The updated bindings (not served here, but from flux-sched) return a Go error (so nil) and not the integer code. The logic is the same either way, it's just using updated bindings and handling a return value of nil/not nil vs 0/not zero. Here is a side by side comparison of lines if it helps:
The old function "err" was an integer, and the new one is nil or not nil. We return an error in either case (not zero or not nil) so that is functionally equivalent.
- err := fluxcli.ReapiCliCancel(s.fctx, int64(in.JobID), true)
+ err := s.cli.Cancel(int64(in.JobID), true)
# Then if err is not nil or err is not zero, return early
This call is basically the same, just using the updated client (this is where I put my comment)
- dr := &pb.CancelResponse{JobID: in.JobID, Error: int32(err)}
+ dr := &pb.CancelResponse{JobID: in.JobID}
Ditto
- reserved, at, overhead, mode, fluxerr := fluxcli.ReapiCliInfo(s.fctx, int64(in.JobID))
+ reserved, at, overhead, mode, fluxerr := s.cli.Info(int64(in.JobID))
And those are the main changes. So I think the functionality is the same either way, but if we revert back it won't work with the new bindings. Let me know if I'm missing a detail!
The updated bindings (not served here, but from flux-sched) return a Go error (so nil)
Got it, I missed that detail. Is there any advantage to dr := &pb.CancelResponse{JobID: in.JobID, Error: 0}
instead of dr := &pb.CancelResponse{JobID: in.JobID}
?
Got it, I missed that detail. Is there any advantage to dr := &pb.CancelResponse{JobID: in.JobID, Error: 0} instead of dr := &pb.CancelResponse{JobID: in.JobID}?
pb is referencing the protocol buffer that (I think) gets built from here and it looks like it takes the jobid as first argument, error as second:
// The Match response message
message CancelResponse {
int64 jobID = 1;
int32 error = 2;
}
I think what that command is doing is taking the result we get from flux-sched, and then just serializing it into the CancelResponse proto that we define. Note that the function we call for the error is here - it's either nil or not nil based on the actual integer return value.
We could theoretically put a code 0
there, but since we check for nil
above (which would happen if the error code is 0) and return if not, the value is always going to be zero, which is the default for an int32 like that, so it's (unstated) but set. To answer your question, I think the latter would just be more explicit for the developer user to see but functionally the same. If we are going to refactor this we probably don't need to worry about that detail.
Edit: the check for != nil
means that that the CancelResponse
error code should be 0
in the following lines of code.
I was mainly concerned with 1) what happens when you don't specify a struct
data member for CancelResponse
, and 2) what is the relationship between the struct
and message
CancelResponse
.
the value is always going to be zero, which is the default for an int32 like that, so it's (unstated) but set
Some of my worries were related to that comment, but assuaged by the Golang spec on "the zero value": https://go.dev/ref/spec#The_zero_value.
We can continue the rest of this discussion in an issue. PR approved.
Excellent work, and thank you @vsoch!
This is the first part of a two part refactor that will update the current main branch with enough changes to be in the "same" state, but with updated underlying libraries and building from kubernetes-sigs/scheduler-plugins.
Summary
Specifically:
Notes and Questions
My one question for the last commit is about the error code we pass here:
I didn't understand it because we would return from the function is the error is less than 0 (indicating an error). So then wouldn't the error value passed not be reflective of the job itself, but of the error of cancelling it? If we want the error code there to be for the job (why the job errored) we would need to find that somewhere. if it's for cancel, then it's always going to be 0 (and we can explicitly set it or leave it out). I just wanted to double check that I understand the logic.
Note that I did the above carefully pinning flux-sched 0.31.0 (most recent release) both for the build and module import. I think we will likely want to update this when we can get SHARED instead of STATIC (and remove the sed statements in the Dockerfile.
Finally, the next step (after this work) and aside from the issue with SHARED libraries above is to add the automated builds/testing that happen regularly, and push of tags that correspond to that. I wanted to do it in two pieces because this PR is already too big!
Ping @cmisale @milroy