flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

Fix overflow in `calc_factor` #1132

Closed milroy closed 4 months ago

milroy commented 4 months ago

In issue #1129 @grondo reported EOVERFLOW errors for match requests for thousands of CPUs and GPUs.

Currently, calc_factor computes the tie breaking factor with modular arithmetic. The computation returns -1 when break_tie is divisible by m_multiply_by (or in other circumstances in modular arithmetic).

The negative value of tie causes std::numeric_limits<int64_t>::max () - tie to overflow in the following integer check, generating a spurious EOVERFLOW errno: https://github.com/flux-framework/flux-sched/blob/3d953e34c8da698276c1ad8e17c14173c36580c5/resource/policies/dfu_match_multilevel_id_impl.hpp#L60

This PR fixes the computation to ensure tie is strictly positive by using abs().

milroy commented 4 months ago

Yes, we should add a check to the testsuite. I'll figure out if I can make an easy change to existing checks to test for this issue and if not I'll adapt your reproducer.

milroy commented 4 months ago

@grondo I think your reproducer is the best way to test for this problem in the testsuite. While integrating it into the testsuite, I noticed that t/issues/t1035-fluxion-reload.sh isn't being run becuse it got left out of CMakeLists: https://github.com/flux-framework/flux-sched/blob/master/t/CMakeLists.txt

Should I re-enable that test in this PR?

grondo commented 4 months ago

The tests under t/issues should be run by the issues test driver t5100-issues-test-driver.t. Is that the test that is actually missing?

Edit: but yeah, seems like that should get fixed and seems fine to do that in this PR.

milroy commented 4 months ago

The tests under t/issues should be run by the issues test driver t5100-issues-test-driver.t. Is that the test that is actually missing?

Yes, t5100-issues-test-driver.t is missing (edit: but I discovered that by thinking I needed to enable issues/t1035-fluxion-reload.sh). I'll add a commit to this PR to re-enable that check.

milroy commented 4 months ago

I added t1129-match-overflow based on your reproducer, @grondo. It works as expected in my development environment, but fails the GitHub checks. The output looks identical between local execution and on GitHub. Do you know what might be causing the failures?

grondo commented 4 months ago

Yes, this was my fault, sorry. As you noticed, the CI environment has sched-simple loaded, so at the end of the test we have to unload Fluxion modules and reload sched-simple to avoid an error.

I also noticed that my test invokes an unnecessary flux start, so try the following patch and see if this works:

diff --git a/t/issues/t1129-match-overflow.sh b/t/issues/t1129-match-overflow.sh
index bc40a960..11314035 100755
--- a/t/issues/t1129-match-overflow.sh
+++ b/t/issues/t1129-match-overflow.sh
@@ -5,11 +5,6 @@

 log() { printf "issue#1129: $@\n" >&2; }

-if test -z "$T1129_RELAUNCHED"; then
-       export T1129_RELAUNCHED=t
-       exec flux start $0
-fi
-
 TEST_SIZE=${TEST_SIZE:-900}

 log "Unloading modules..."
@@ -41,3 +36,8 @@ log "Running test job."
 flux run -vvv -N${TEST_SIZE} -n${TEST_SIZE} \
        --setattr=exec.test.run_duration=1ms \
        true
+
+log "reloading sched-simple..."
+flux module remove sched-fluxion-qmanager
+flux module remove sched-fluxion-resource
+flux module load sched-simple
milroy commented 4 months ago

Thanks for the fix! I wonder why the previous version worked on my local bookworm container.

Anyway, if this looks good to you now I'll set MWP.

grondo commented 4 months ago

I wonder why the previous version worked on my local bookworm container.

If you were using a flux-sched container perhaps the fluxion rc3 script is present and removes the Fluxion modules for us. Perhaps the reload of sched-simple isn't actually required. (I'm not sure)

grondo commented 4 months ago

Yes, looks great, thanks! Feel free to set MWP.

milroy commented 4 months ago

Thanks! Setting MWP.

BTW, the test driver script setup is slick and easy to extend.

codecov[bot] commented 4 months ago

Codecov Report

Merging #1132 (6cff96e) into master (f5f99c5) will increase coverage by 0.0%. The diff coverage is 100.0%.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #1132 +/- ## ====================================== Coverage 70.5% 70.6% ====================================== Files 94 94 Lines 12723 12723 ====================================== + Hits 8982 8984 +2 + Misses 3741 3739 -2 ``` | [Files](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1132?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework) | Coverage Δ | | |---|---|---| | [resource/policies/dfu\_match\_multilevel\_id\_impl.hpp](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1132?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework#diff-cmVzb3VyY2UvcG9saWNpZXMvZGZ1X21hdGNoX211bHRpbGV2ZWxfaWRfaW1wbC5ocHA=) | `82.3% <100.0%> (ø)` | | ... and [1 file with indirect coverage changes](https://app.codecov.io/gh/flux-framework/flux-sched/pull/1132/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=flux-framework)
milroy commented 4 months ago

Maybe MWP won't merge unless there's an explicit reviewer approval. @grondo can you submit an approval for the PR?

grondo commented 4 months ago

Apologies @milroy, I had meant to approve!