chaos / powerman

cluster power control
GNU General Public License v2.0
43 stars 19 forks source link

powerman 2.4.4 post-build test failure in Debian risc-v building #206

Closed hosiet closed 1 month ago

hosiet commented 1 month ago

Looking at the following pages:

I am not sure why there are some constant test failures on riscv. Related logs:

ERROR: t0039-llnl-el-capitan-cluster
====================================

ok 1 - create powerman.conf for El Cap
PASS: t0039-llnl-el-capitan-cluster.t 1 - create powerman.conf for El Cap
not ok 2 - start powerman daemon and wait for it to start
FAIL: t0039-llnl-el-capitan-cluster.t 2 - start powerman daemon and wait for it to start
#   
#       $powermand -Y -c powerman.conf &
#       echo $! >powermand.pid &&
#       $powerman --retry-connect=100 --server-host=$testaddr -d >device.out
#   
not ok 3 - powerman -q shows all off
FAIL: t0039-llnl-el-capitan-cluster.t 3 - powerman -q shows all off
#   
#       $powerman -h $testaddr -q >query.out &&
#       makeoutput "" "$ALLSTR" "" >query.exp &&
#       test_cmp query.exp query.out
#   
not ok 4 - powerman can turn on Enlcosures
FAIL: t0039-llnl-el-capitan-cluster.t 4 - powerman can turn on Enlcosures
#   
#       $powerman -h $testaddr -1 $CMMSTR >on.out &&
#       is_successful on.out
#   
not ok 5 - powerman -q shows Enclosures on
FAIL: t0039-llnl-el-capitan-cluster.t 5 - powerman -q shows Enclosures on
#   
#       $powerman -h $testaddr -q >query2.out &&
#       makeoutput "$CMMSTR" "$NODESTR,$BLADESTR,$PERIFSTR" "" >query2.exp &&
#       test_cmp query2.exp query2.out
#   
not ok 6 - powerman can turn on Blades + Perifs
FAIL: t0039-llnl-el-capitan-cluster.t 6 - powerman can turn on Blades + Perifs
#   
#       $powerman -h $testaddr -1 $BLADESTR,$PERIFSTR >on2.out &&
#       is_successful on2.out
#   
not ok 7 - powerman -q shows Enclosures + Blades + Perifs on
FAIL: t0039-llnl-el-capitan-cluster.t 7 - powerman -q shows Enclosures + Blades + Perifs on
#   
#       $powerman -h $testaddr -q >query3.out &&
#       makeoutput "$BLADESTR,$CMMSTR,$PERIFSTR" "$NODESTR" >query3.exp &&
#       test_cmp query3.exp query3.out
#   
not ok 8 - powerman can turn on Nodes
FAIL: t0039-llnl-el-capitan-cluster.t 8 - powerman can turn on Nodes
#   
#       $powerman -h $testaddr -1 $NODESTR >on3.out &&
#       is_successful on3.out
#   
not ok 9 - powerman -q shows all on
FAIL: t0039-llnl-el-capitan-cluster.t 9 - powerman -q shows all on
#   
#       $powerman -h $testaddr -q >query4.out &&
#       makeoutput "$ALLSTR" "" "" >query4.exp &&
#       test_cmp query4.exp query4.out
#   
not ok 10 - powerman -0 all works
FAIL: t0039-llnl-el-capitan-cluster.t 10 - powerman -0 all works
#   
#       $powerman -h $testaddr -0 $ALLSTR >off.out &&
#       is_successful off.out
#   
not ok 11 - powerman -q shows all off
FAIL: t0039-llnl-el-capitan-cluster.t 11 - powerman -q shows all off
#   
#       $powerman -h $testaddr -q >query5.out &&
#       makeoutput "" "$ALLSTR" "" >query5.exp &&
#       test_cmp query5.exp query5.out
#   
not ok 12 - powerman -q works with giant input
FAIL: t0039-llnl-el-capitan-cluster.t 12 - powerman -q works with giant input
#   
#       nodes=$(echo elcap\[$(seq -s, 0 2 16382)\]) &&
#       $powerman -h $testaddr -q $nodes >query6.out &&
#       makeoutput "" "$nodes" "" >query6.exp &&
#       test_cmp query6.exp query6.out
#   
ok 13 - stop powerman daemon
PASS: t0039-llnl-el-capitan-cluster.t 13 - stop powerman daemon
# failed 11 among 13 test(s)
1..13
ERROR: t0039-llnl-el-capitan-cluster.t - exited with status 1

============================================================================
Testsuite summary for powerman 2.4.4
============================================================================
# TOTAL: 776
# PASS:  755
# SKIP:  8
# XFAIL: 1
# FAIL:  11
# XPASS: 0
# ERROR: 1

Personally I am not an expert in riscv or powerman, so any suggestion or hints are appreciated.

garlick commented 1 month ago

Hi @hosiet - that test is a pretty large scale one. I wonder if it is running the builder out of memory or just running slow and exceeding some timeouts. It also tries to raise the soft max open file limit to 2048 which could be unsuccessful depending no the hard limit in the test environment.

Our test suite probably needs some work to capture more detail on failure. What I would do manually is run the test directly with a verbose option, e.g.

$ cd t
$ ./t0039-llnl-el-capitan-cluster.t -v

But anyway, we could disable that test by default and selectively enable it in CI. Could you try with this patch?

diff --git a/t/t0039-llnl-el-capitan-cluster.t b/t/t0039-llnl-el-capitan-cluster.t
index cf0d780..c572547 100755
--- a/t/t0039-llnl-el-capitan-cluster.t
+++ b/t/t0039-llnl-el-capitan-cluster.t
@@ -4,6 +4,12 @@ test_description='Check LLNL El Capitan config'

 . `dirname $0`/sharness.sh

+test -n "$TEST_LONG" && test_set_prereq LONGTEST
+if ! test_have_prereq LONGTEST; then
+        skip_all='skipping large scale El Capitan test'
+        test_done
+fi
+
 ulimit -n 2048

 powermand=$SHARNESS_BUILD_DIRECTORY/src/powerman/powermand
garlick commented 1 month ago

@hosiet - please let us know if the just-merged fix doesn't resolve this.

hosiet commented 1 month ago

Thanks for the patch that disables this certain test in post-build tests. The build is now OK as shown on https://buildd.debian.org/status/package.php?p=powerman .

Probably we can run that certain problematic test in CI rather than post-build test. Debian has such CI infrastructure, and I can try to see if having that test executed in the CI with verbose option enabled could obtain more useful debugging info.

garlick commented 1 month ago

I just tried that test on a raspberry pi 4 with 2GB RAM running raspbian 12 and got an oom kill:

[35622.739829] Out of memory: Killed process 10905 (powermand) total-vm:557444kB, anon-rss:497408kB, file-rss:1792kB, shmem-rss:0kB, UID:5588 pgtables:1116kB oom_score_adj:0

It worked (slowly) when I re-ran it on a pi 4 with 4GB RAM.

My guess is that is the problem and you probably don't need to take it further. The test doesn't cover unique functionality; it is just a scaling test.