flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
90 stars 41 forks source link

remove: Final .free RPC failed to remove all resources for jobid 1395310723072: Success #1301

Closed garlick closed 1 month ago

garlick commented 2 months ago

Problem: spurious error?

I'm getting this error on flux-sched-0.38.0-7-g2bd4253d whenever I run something that uses cores on more than one node, but not whole nodes

Sep 26 11:56:14.362759 PDT sched-fluxion-qmanager.err[0]: remove: Final .free RPC failed to remove all resources for jobid 3453455695872: Success
Sep 26 11:56:14.363202 PDT sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=default id=3453455695872): Protocol error

For example, on a 2 node allocation with 4 cores per node, I can run on 1, 2, 3, 4, and 8 cores with no error. But 5, 6, 7 generate that error.

FWIW:

$ flux module trace sched-fluxion-qmanager
2024-09-26T12:03:57.006 sched-fluxion-qmanager rx > sched.alloc [462]
2024-09-26T12:03:57.007 sched-fluxion-qmanager tx > sched-fluxion-resource.match_multi [678]
2024-09-26T12:03:57.009 sched-fluxion-qmanager rx < sched-fluxion-resource.match_multi [370]
2024-09-26T12:03:57.009 sched-fluxion-qmanager rx < sched-fluxion-resource.match_multi [0]
2024-09-26T12:03:57.009 sched-fluxion-qmanager tx > kvs.commit [450]
2024-09-26T12:03:57.012 sched-fluxion-qmanager rx < kvs.commit [72]
2024-09-26T12:03:57.012 sched-fluxion-qmanager tx < sched.alloc [294]
2024-09-26T12:03:57.173 sched-fluxion-qmanager rx > sched.free [254]
2024-09-26T12:03:57.173 sched-fluxion-qmanager tx > sched-fluxion-resource.partial-cancel [286]
2024-09-26T12:03:57.173 sched-fluxion-qmanager rx < sched-fluxion-resource.partial-cancel [19]
2024-09-26T12:03:57.173 sched-fluxion-qmanager tx > log.append [155]
2024-09-26T12:03:57.174 sched-fluxion-qmanager tx > sched-fluxion-resource.cancel [24]
2024-09-26T12:03:57.174 sched-fluxion-qmanager rx < sched-fluxion-resource.cancel [3]
2024-09-26T12:03:57.174 sched-fluxion-qmanager tx > log.append [143]

The resources are actually freed though, and I can reallocate them.

garlick commented 1 month ago

easy reproducer:

$ flux start -s2 sh -c 'flux run -n $(($(flux resource list -no {ncores})-1)) true'
Oct 03 14:40:01.444840 PDT sched-fluxion-qmanager.err[0]: remove: Final .free RPC failed to remove all resources for jobid 12733906944: Success
Oct 03 14:40:01.444937 PDT sched-fluxion-qmanager.err[0]: jobmanager_free_cb: remove (queue=default id=12733906944): Protocol error
grondo commented 1 month ago

I'm seeing this on my cluster as well. In my case I'm allocating full nodes, but my nodes have different numbers of cores. The trigger for this bug seems to be when there are multiple entries in the R_lite array, i.e. there are at least two ranks that have different core idsets assigned. E.g.

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "2-3",
        "children": {
          "core": "0-3"
        }
      },
      {
        "rank": "4",
        "children": {
          "core": "0-7"
        }
      }
    ],
    "nodelist": [
      "pi[1-2,4]"
    ],
    "properties": {
      "cm4": "2-3",
      "rk1": "4"
    },
    "starttime": 1727999001,
    "expiration": 1728002601
  }
}

showed the error while

{
  "version": 1,
  "execution": {
    "R_lite": [
      {
        "rank": "2-3",
        "children": {
          "core": "0-3"
        }
      }
    ],
    "nodelist": [
      "pi[1-2]"
    ],
    "properties": {
      "cm4": "2-3"
    },
    "starttime": 1727998990,
    "expiration": 1728002590
  }
}

Does not.

grondo commented 1 month ago

This seems to fix the error for me:

diff --git a/resource/readers/resource_reader_rv1exec.cpp b/resource/readers/resource_reader_rv1exec.cpp
index d630d239..9ed64626 100644
--- a/resource/readers/resource_reader_rv1exec.cpp
+++ b/resource/readers/resource_reader_rv1exec.cpp
@@ -961,15 +961,15 @@ int resource_reader_rv1exec_t::partial_cancel_internal (resource_graph_t &g,
             errno = EINVAL;
             goto error;
         }
+        if (!(r_ids = idset_decode (ranks)))
+            goto error;
+        rank = idset_first (r_ids);
+        while (rank != IDSET_INVALID_ID) {
+            mod_data.ranks_removed.insert (rank);
+            rank = idset_next (r_ids, rank);
+        }
+        idset_destroy (r_ids);
     }
-    if (!(r_ids = idset_decode (ranks)))
-        goto error;
-    rank = idset_first (r_ids);
-    while (rank != IDSET_INVALID_ID) {
-        mod_data.ranks_removed.insert (rank);
-        rank = idset_next (r_ids, rank);
-    }
-    idset_destroy (r_ids);

Though there's probably a much more efficient way to do this (e.g. build the idset, then insert into ranks_removed)