Closed grondo closed 4 weeks ago
This issue persists across a fluxion module reload. From the logs:
[ +9.399184] sched-fluxion-resource[0]: resource status changed (rankset=[all] status=DOWN)
[ +9.399232] sched-fluxion-resource[0]: resource status changed (rankset=[3-13,16-20,22-39,41-57,59-61,63-72,75,77-88,90-93,96-99] status=UP)
rank 100 is never marked up, but Fluxion still thinks it is up. Ideas?
Ok, adding some debugging, Fluxion doesn't seem to include rank 100 in its "all"
set of ranks:
[ +0.025965] sched-fluxion-resource[0]: decoding rankset all = [0, 99]
Fluxion uses by_rank.size()
to determine the rankset for all
:
However, the scheduler is not presented with excluded ranks, so this is invalid. For example on fluke:
"by_rank": {
"[3-100]": 5
},
decode_all()
needs to be updated to read the actual ranks from the graph, and not make an assumption that the ranks are 0-(size-1)
This change seems to resolve the issue
diff --git a/resource/modules/resource_match.cpp b/resource/modules/resource_match.cpp
index 0d6d1c42..d7b1dd05 100644
--- a/resource/modules/resource_match.cpp
+++ b/resource/modules/resource_match.cpp
@@ -1144,13 +1144,13 @@ done:
static int decode_all (std::shared_ptr<resource_ctx_t> &ctx,
std::set<int64_t> &ranks)
{
- int64_t size = ctx->db->metadata.by_rank.size();
-
- for (int64_t rank = 0; rank < size; ++rank) {
- auto ret = ranks.insert (rank);
- if (!ret.second) {
- errno = EEXIST;
- return -1;
+ for (auto const& kv: ctx->db->metadata.by_rank) {
+ if (kv.first >= 0 && kv.first < IDSET_INVALID_ID) {
+ auto ret = ranks.insert (kv.first);
+ if (!ret.second) {
+ errno = EEXIST;
+ return -1;
+ }
}
}
return 0;
For future reference, Fluxion vertices can be manually marked as down by sending an RPC like so (Python):
import flux
payload = {"resource_path": "/path/to/vertex", "status": "down"}
flux.Flux().rpc("sched-fluxion-resource.set_status", payload)
Finding the path to any vertex can be tricky, using the resource-query
utility is the best way that I know of.
We've seen a couple instances now where Fluxion loses track of one or more down nodes and allocates a down node to a job, which promptly fails, allocates the down node to the next job, which fails, etc.
This mismatch can be seen currently on the fluke cluster with the following script:
The host
fluke103
has been 'offline' since Nov 8 2023 (actually I need to double check that exact date, but the node has definitely not been online since the last Flux restart).Since all resources should be marked down by fluxion until the resource.acquire protocols tells them they are up, this seems to be due to fluxion marking resources up that are not included in the resource.acquire protocol. Of course, there is no way to prove the core resource module did not send this rank in an up idset in a response, so perhaps more investigation needed.
I'm also unsure why the other down nodes did not have the same result.
We could perhaps do some debugging by reloading fluxion to see if the problem corrects itself on fluke. If not then we can enable more debugging, etc.