Closed milroy closed 4 months ago
Yes, we should add a check to the testsuite. I'll figure out if I can make an easy change to existing checks to test for this issue and if not I'll adapt your reproducer.
@grondo I think your reproducer is the best way to test for this problem in the testsuite. While integrating it into the testsuite, I noticed that t/issues/t1035-fluxion-reload.sh
isn't being run becuse it got left out of CMakeLists
: https://github.com/flux-framework/flux-sched/blob/master/t/CMakeLists.txt
Should I re-enable that test in this PR?
The tests under t/issues
should be run by the issues test driver t5100-issues-test-driver.t
. Is that the test that is actually missing?
Edit: but yeah, seems like that should get fixed and seems fine to do that in this PR.
The tests under t/issues should be run by the issues test driver t5100-issues-test-driver.t. Is that the test that is actually missing?
Yes, t5100-issues-test-driver.t
is missing (edit: but I discovered that by thinking I needed to enable issues/t1035-fluxion-reload.sh
). I'll add a commit to this PR to re-enable that check.
I added t1129-match-overflow
based on your reproducer, @grondo. It works as expected in my development environment, but fails the GitHub checks. The output looks identical between local execution and on GitHub. Do you know what might be causing the failures?
Yes, this was my fault, sorry. As you noticed, the CI environment has sched-simple loaded, so at the end of the test we have to unload Fluxion modules and reload sched-simple to avoid an error.
I also noticed that my test invokes an unnecessary flux start
, so try the following patch and see if this works:
diff --git a/t/issues/t1129-match-overflow.sh b/t/issues/t1129-match-overflow.sh
index bc40a960..11314035 100755
--- a/t/issues/t1129-match-overflow.sh
+++ b/t/issues/t1129-match-overflow.sh
@@ -5,11 +5,6 @@
log() { printf "issue#1129: $@\n" >&2; }
-if test -z "$T1129_RELAUNCHED"; then
- export T1129_RELAUNCHED=t
- exec flux start $0
-fi
-
TEST_SIZE=${TEST_SIZE:-900}
log "Unloading modules..."
@@ -41,3 +36,8 @@ log "Running test job."
flux run -vvv -N${TEST_SIZE} -n${TEST_SIZE} \
--setattr=exec.test.run_duration=1ms \
true
+
+log "reloading sched-simple..."
+flux module remove sched-fluxion-qmanager
+flux module remove sched-fluxion-resource
+flux module load sched-simple
Thanks for the fix! I wonder why the previous version worked on my local bookworm
container.
Anyway, if this looks good to you now I'll set MWP.
I wonder why the previous version worked on my local bookworm container.
If you were using a flux-sched container perhaps the fluxion rc3 script is present and removes the Fluxion modules for us. Perhaps the reload of sched-simple isn't actually required. (I'm not sure)
Yes, looks great, thanks! Feel free to set MWP.
Thanks! Setting MWP.
BTW, the test driver script setup is slick and easy to extend.
Merging #1132 (6cff96e) into master (f5f99c5) will increase coverage by
0.0%
. The diff coverage is100.0%
.
Maybe MWP won't merge unless there's an explicit reviewer approval. @grondo can you submit an approval for the PR?
Apologies @milroy, I had meant to approve!
In issue #1129 @grondo reported
EOVERFLOW
errors for match requests for thousands of CPUs and GPUs.Currently,
calc_factor
computes the tie breaking factor with modular arithmetic. The computation returns -1 whenbreak_tie
is divisible bym_multiply_by
(or in other circumstances in modular arithmetic).The negative value of
tie
causesstd::numeric_limits<int64_t>::max () - tie
to overflow in the following integer check, generating a spuriousEOVERFLOW
errno: https://github.com/flux-framework/flux-sched/blob/3d953e34c8da698276c1ad8e17c14173c36580c5/resource/policies/dfu_match_multilevel_id_impl.hpp#L60This PR fixes the computation to ensure
tie
is strictly positive by usingabs()
.