Closed garlick closed 1 month ago
easy reproducer:
$ flux start -s2 sh -c 'flux run -n $(($(flux resource list -no {ncores})-1)) true'
Oct 03 14:40:01.444840 PDT sched-fluxion-qmanager.err[0]: remove: Final .free RPC failed to remove all resources for jobid 12733906944: Success
Oct 03 14:40:01.444937 PDT sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=default id=12733906944): Protocol error
I'm seeing this on my cluster as well. In my case I'm allocating full nodes, but my nodes have different numbers of cores. The trigger for this bug seems to be when there are multiple entries in the R_lite
array, i.e. there are at least two ranks that have different core idsets assigned. E.g.
{
"version": 1,
"execution": {
"R_lite": [
{
"rank": "2-3",
"children": {
"core": "0-3"
}
},
{
"rank": "4",
"children": {
"core": "0-7"
}
}
],
"nodelist": [
"pi[1-2,4]"
],
"properties": {
"cm4": "2-3",
"rk1": "4"
},
"starttime": 1727999001,
"expiration": 1728002601
}
}
showed the error while
{
"version": 1,
"execution": {
"R_lite": [
{
"rank": "2-3",
"children": {
"core": "0-3"
}
}
],
"nodelist": [
"pi[1-2]"
],
"properties": {
"cm4": "2-3"
},
"starttime": 1727998990,
"expiration": 1728002590
}
}
Does not.
This seems to fix the error for me:
diff --git a/resource/readers/resource_reader_rv1exec.cpp b/resource/readers/resource_reader_rv1exec.cpp
index d630d239..9ed64626 100644
--- a/resource/readers/resource_reader_rv1exec.cpp
+++ b/resource/readers/resource_reader_rv1exec.cpp
@@ -961,15 +961,15 @@ int resource_reader_rv1exec_t::partial_cancel_internal (resource_graph_t &g,
errno = EINVAL;
goto error;
}
+ if (!(r_ids = idset_decode (ranks)))
+ goto error;
+ rank = idset_first (r_ids);
+ while (rank != IDSET_INVALID_ID) {
+ mod_data.ranks_removed.insert (rank);
+ rank = idset_next (r_ids, rank);
+ }
+ idset_destroy (r_ids);
}
- if (!(r_ids = idset_decode (ranks)))
- goto error;
- rank = idset_first (r_ids);
- while (rank != IDSET_INVALID_ID) {
- mod_data.ranks_removed.insert (rank);
- rank = idset_next (r_ids, rank);
- }
- idset_destroy (r_ids);
Though there's probably a much more efficient way to do this (e.g. build the idset, then insert into ranks_removed
)
Problem: spurious error?
I'm getting this error on
flux-sched-0.38.0-7-g2bd4253d
whenever I run something that uses cores on more than one node, but not whole nodesFor example, on a 2 node allocation with 4 cores per node, I can run on 1, 2, 3, 4, and 8 cores with no error. But 5, 6, 7 generate that error.
FWIW:
The resources are actually freed though, and I can reallocate them.