flux-framework / flux-sched

Fluxion Graph-based Scheduler
GNU Lesser General Public License v3.0
84 stars 39 forks source link

`t8001` fails in CI, bracket mismatch? #1165

Open wihobbs opened 1 month ago

wihobbs commented 1 month ago

It looks there's a mismatch in t8001-util-ion-R.t:

not ok 6 - fluxion-R: encoding properties on heterogeneity works
#
#       cat <<-EOF >expected6 &&
#       /cluster0 -1 {}
#       /cluster0/foo2 0 {"arm-v9@core":""}
#       /cluster0/foo2/core0 0 {}
#       /cluster0/foo2/core1 0 {}
#       /cluster0/foo2/gpu0 0 {}
#       /cluster0/foo2/gpu1 0 {}
#       /cluster0/foo3 2 {"arm-v9@core":"","amd-mi60@gpu":""}
#       /cluster0/foo3/core0 2 {}
#       /cluster0/foo3/core1 2 {}
#       /cluster0/foo3/gpu0 2 {}
#       /cluster0/foo3/gpu1 2 {}
#       /cluster0/foo1 3 {"arm-v9@core":"","amd-mi60@gpu":""}
#       /cluster0/foo1/core0 3 {}
#       /cluster0/foo1/core1 3 {}
#       /cluster0/foo1/gpu0 3 {}
#       /cluster0/foo1/gpu1 3 {}
#       /cluster0/foo4 1 {"arm-v8@core":""}
#       /cluster0/foo4/core0 1 {}
#       EOF
#       flux R encode -r 0 -c 0-1 -g 0-1 -p "arm-v9@core:0" -H foo2 > out6 &&
#       flux R encode -r 1 -c 0 -H foo3 -p "arm-v8@core:1" >> out6 &&
#       flux R encode -r 2-3 -c 0-1 -g 0-1 -p "arm-v9@core:2-3" \
#       -p "amd-mi60@gpu:2-3" -H foo[1,4] >> out6 &&
#       cat out6 | flux R append > combined6.json &&
#       cat combined6.json | flux ion-R encode > augmented6.json &&
#       jq .scheduling augmented6.json > jgf6.json &&
#       print_schema2 jgf6.json paths6 &&
#       test_cmp expected6 paths6
#

# failed 1 among 6 test(s)

The diff shows that the actual output isn't json, which is odd:

(s=33,d=0) fluxci@tioga10 /usr/WS1/fluxci/cibuilds/399712_tioga/flux-sched (master)$ diff trash-directory.t8001-util-ion-R/paths6 trash-directory.t8001-util-ion-R/expected6
1,18c1,18
< /cluster0 -1 []
< /cluster0/foo2 0 ["arm-v9@core"]
< /cluster0/foo2/core0 0 []
< /cluster0/foo2/core1 0 []
< /cluster0/foo2/gpu0 0 []
< /cluster0/foo2/gpu1 0 []
< /cluster0/foo3 2 ["arm-v9@core","amd-mi60@gpu"]
< /cluster0/foo3/core0 2 []
< /cluster0/foo3/core1 2 []
< /cluster0/foo3/gpu0 2 []
< /cluster0/foo3/gpu1 2 []
< /cluster0/foo1 3 ["arm-v9@core","amd-mi60@gpu"]
< /cluster0/foo1/core0 3 []
< /cluster0/foo1/core1 3 []
< /cluster0/foo1/gpu0 3 []
< /cluster0/foo1/gpu1 3 []
< /cluster0/foo4 1 ["arm-v8@core"]
< /cluster0/foo4/core0 1 []
---
> /cluster0 -1 {}
> /cluster0/foo2 0 {"arm-v9@core":""}
> /cluster0/foo2/core0 0 {}
> /cluster0/foo2/core1 0 {}
> /cluster0/foo2/gpu0 0 {}
> /cluster0/foo2/gpu1 0 {}
> /cluster0/foo3 2 {"arm-v9@core":"","amd-mi60@gpu":""}
> /cluster0/foo3/core0 2 {}
> /cluster0/foo3/core1 2 {}
> /cluster0/foo3/gpu0 2 {}
> /cluster0/foo3/gpu1 2 {}
> /cluster0/foo1 3 {"arm-v9@core":"","amd-mi60@gpu":""}
> /cluster0/foo1/core0 3 {}
> /cluster0/foo1/core1 3 {}
> /cluster0/foo1/gpu0 3 {}
> /cluster0/foo1/gpu1 3 {}
> /cluster0/foo4 1 {"arm-v8@core":""}
> /cluster0/foo4/core0 1 {}
jameshcorbett commented 1 month ago

I'm not sure what's going on here but the relevant code was changed in https://github.com/flux-framework/flux-sched/pull/1149

jameshcorbett commented 1 month ago

I wonder if somehow an older version of the Python FluxionResourceGraphV1 class is being picked up? Like maybe there's another version of its module in sys.path for some reason?

wihobbs commented 1 month ago

These are the others that fail FYI:

26:t1018-rv1-bootstrap2.t
61:t3027-resource-RV.t
71:t3301-system-latestart.t
89:t8001-util-ion-R.t

It's definitely that change that made it break on LC, but I wonder why. I'll check the sys.path, but could it have to do with our version of jq maybe? jq-1.6 Although I'd expect more failures if that were the case...

wihobbs commented 1 month ago

If you want to poke at these failing tests, you can xsu fluxci and see the logs, along with the binary they're running under:

cd /usr/WS1/fluxci/cibuilds/399712_tioga/flux-sched/
ctest -j16 --rerun-failed --output-on-failure
grondo commented 1 month ago

I wonder if somehow an older version of the Python FluxionResourceGraphV1 class is being picked up? Like maybe there's another version of its module in sys.path for some reason?

That's a good guess since the same tests do not fail in github CI. I wonder if the tests are appending instead of prepending the path to the builddir Fluxion Python modules. The CI @wihobbs is talking about here is the gitlab CI which runs on a system with flux-sched RPMs installed.