fluxion loses track of down nodes

grondo commented 1 month ago

We've seen a couple instances now where Fluxion loses track of one or more down nodes and allocates a down node to a job, which promptly fails, allocates the down node to the next job, which fails, etc.

This mismatch can be seen currently on the fluke cluster with the following script:

import flux
from flux.resource.list import ResourceListRPC

h = flux.Flux()

rpc1 = ResourceListRPC(h, "resource.sched-status", nodeid=0)
rpc2 = ResourceListRPC(h, "sched.resource-status", nodeid=0)

rset = rpc1.get()
fluxion = rpc2.get()

def symmetric_diff(a, b):
    return (a|b) - (a&b)

diff = symmetric_diff(rset.down, fluxion.down)
if diff.ranks:
    print("difference detected between fluxion and core down ranks:")
    print(f"hosts: {diff.nodelist}")
    print(f"ranks: {diff.ranks}")

$ flux python ./rcheck.py
difference detected between fluxion and core down ranks:
hosts: fluke103
ranks: 100

The host fluke103 has been 'offline' since Nov 8 2023 (actually I need to double check that exact date, but the node has definitely not been online since the last Flux restart).

Since all resources should be marked down by fluxion until the resource.acquire protocols tells them they are up, this seems to be due to fluxion marking resources up that are not included in the resource.acquire protocol. Of course, there is no way to prove the core resource module did not send this rank in an up idset in a response, so perhaps more investigation needed.

I'm also unsure why the other down nodes did not have the same result.

We could perhaps do some debugging by reloading fluxion to see if the problem corrects itself on fluke. If not then we can enable more debugging, etc.

grondo commented 4 weeks ago

This issue persists across a fluxion module reload. From the logs:

[  +9.399184] sched-fluxion-resource[0]: resource status changed (rankset=[all] status=DOWN)
[  +9.399232] sched-fluxion-resource[0]: resource status changed (rankset=[3-13,16-20,22-39,41-57,59-61,63-72,75,77-88,90-93,96-99] status=UP)

rank 100 is never marked up, but Fluxion still thinks it is up. Ideas?

grondo commented 4 weeks ago

Ok, adding some debugging, Fluxion doesn't seem to include rank 100 in its "all" set of ranks:

[  +0.025965] sched-fluxion-resource[0]: decoding rankset all = [0, 99]

Fluxion uses by_rank.size() to determine the rankset for all:

https://github.com/flux-framework/flux-sched/blob/4918eca407b330ff78513490659e21dae35ef72b/resource/modules/resource_match.cpp#L1144-L1157

However, the scheduler is not presented with excluded ranks, so this is invalid. For example on fluke:

 "by_rank": {
  "[3-100]": 5
 },

decode_all() needs to be updated to read the actual ranks from the graph, and not make an assumption that the ranks are 0-(size-1)

grondo commented 4 weeks ago

This change seems to resolve the issue

diff --git a/resource/modules/resource_match.cpp b/resource/modules/resource_match.cpp
index 0d6d1c42..d7b1dd05 100644
--- a/resource/modules/resource_match.cpp
+++ b/resource/modules/resource_match.cpp
@@ -1144,13 +1144,13 @@ done:
 static int decode_all (std::shared_ptr<resource_ctx_t> &ctx,
                        std::set<int64_t> &ranks)
 {
-    int64_t size = ctx->db->metadata.by_rank.size();
-    
-    for (int64_t rank = 0; rank < size; ++rank) {
-        auto ret = ranks.insert (rank);
-        if (!ret.second) {
-            errno = EEXIST;
-            return -1;
+    for (auto const& kv: ctx->db->metadata.by_rank) {
+        if (kv.first >= 0 && kv.first < IDSET_INVALID_ID) {
+            auto ret = ranks.insert (kv.first);
+            if (!ret.second) {
+                errno = EEXIST;
+                return -1;
+            }
         }
     }
     return 0;

jameshcorbett commented 3 weeks ago

For future reference, Fluxion vertices can be manually marked as down by sending an RPC like so (Python):

import flux
payload = {"resource_path": "/path/to/vertex", "status": "down"}
flux.Flux().rpc("sched-fluxion-resource.set_status", payload)

Finding the path to any vertex can be tricky, using the resource-query utility is the best way that I know of.

flux-framework / flux-sched

fluxion loses track of down nodes #1182