Algebraic-Programming / LPF

A minimal communication layer for the implementation of immortal algorithms and for facilitating their broad use.
Apache License 2.0
5 stars 1 forks source link

Under our Slurm nodes, hybrid engine resets to 1 MPI process and 1 Pthread, and fails when expected > 1 processes #33

Open KADichev opened 1 hour ago

KADichev commented 1 hour ago

If we book a node e.g. via

srun -p Cascade --ntasks 1 --cpus-per-task 32  -t 08:00:00 --pty /bin/bash

Note that this is NOT the typical way to ask for MPI resources, but I prefer it because we actually get many cores which compile fast.

E.g. using the branch https://github.com/Algebraic-Programming/LPF/tree/functional_tests_use_gtest

then most hybrid engine jobs will fail any test checking that we run with > 1 task. For example:

ctest -R hybrid_API.func_lpf_exec_multiple_call_single_arg_dual_proc --verbose
UpdateCTestConfiguration  from :/home/kdichev/LPF/build-x86/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/kdichev/LPF/build-x86/DartConfiguration.tcl
Test project /home/kdichev/LPF/build-x86
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 583
    Start 583: hybrid_API.func_lpf_exec_multiple_call_single_arg_dual_proc

583: Test command: /home/kdichev/LPF/build-x86/test_launcher.py "-e" "hybrid" "-L" "/home/kdichev/LPF/build-x86/lpfrun_build" "-p" "2" "-P" "5" "-t" "0.0" "-R" "0" "/home/kdichev/LPF/build-x86/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug" "--gtest_filter=API.func_lpf_exec_multiple_call_single_arg_dual_proc" "--gtest_also_run_disabled_tests" "--gtest_output=xml:/home/kdichev/LPF/build-x86/junit/hybrid_func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug"
583: Working Directory: /home/kdichev/LPF/build-x86/tests/functional
583: Test timeout computed to be: 10000000
583: Running main() from /scratch/kdichev/.spack/stage/spack-stage-googletest-1.14.0-afvplm5m2qrmzvpapg7hx7dbfqff332z/spack-src/googletest/src/gtest_main.cc
583: Note: Google Test filter = API.func_lpf_exec_multiple_call_single_arg_dual_proc
583: [==========] Running 1 test from 1 test suite.
583: [----------] Global test environment set-up.
583: [----------] 1 test from API
583: [ RUN      ] API.func_lpf_exec_multiple_call_single_arg_dual_proc
583: /home/kdichev/LPF/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc.cpp:31: Failure
583: Expected equality of these values:
583:   nprocs
583:     Which is: 1
583:   2
583: 
583: /home/kdichev/LPF/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc.cpp:56: Failure
583: Expected equality of these values:
583:   nprocs
583:     Which is: 1
583:   2
583: 
583: [  FAILED  ] API.func_lpf_exec_multiple_call_single_arg_dual_proc (138 ms)
583: [----------] 1 test from API (138 ms total)
583: 
583: [----------] Global test environment tear-down
583: [==========] 1 test from 1 test suite ran. (138 ms total)
583: [  PASSED  ] 0 tests.
583: [  FAILED  ] 1 test, listed below:
583: [  FAILED  ] API.func_lpf_exec_multiple_call_single_arg_dual_proc
583: 
583:  1 FAILED TEST
583: --------------------------------------------------------------------------
583: Primary job  terminated normally, but 1 process returned
583: a non-zero exit code. Per user-direction, the job has been aborted.
583: --------------------------------------------------------------------------
583: --------------------------------------------------------------------------
583: mpirun detected that one or more processes exited with non-zero status, thus causing
583: the job to be terminated. The first process to do so was:
583: 
583:   Process name: [[15695,1],0]
583:   Exit code:    1
583: --------------------------------------------------------------------------
583: Run command: 
583: ['/home/kdichev/LPF/build-x86/lpfrun_build', '-engine', 'hybrid', '-n', '2', '/home/kdichev/LPF/build-x86/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug', '--gtest_filter=API.func_lpf_exec_multiple_call_single_arg_dual_proc', '--gtest_also_run_disabled_tests', '--gtest_output=xml:/home/kdichev/LPF/build-x86/junit/hybrid_func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug']
583: Test returned code = 1
583: Test /home/kdichev/LPF/build-x86/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug--gtest_filter=API.func_lpf_exec_multiple_call_single_arg_dual_proc
583: returned   1
583: expected return code was: 0
1/1 Test #583: hybrid_API.func_lpf_exec_multiple_call_single_arg_dual_proc ...***Failed    3.04 sec

0% tests passed, 1 tests failed out of 1

Total Test time (real) =   3.89 sec

The following tests FAILED:
    583 - hybrid_API.func_lpf_exec_multiple_call_single_arg_dual_proc (Failed)
Errors while running CTest
Output from these tests are in: /home/kdichev/LPF/build-x86/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
KADichev commented 1 hour ago

It seems to me the lpfrun launcher is broken:

HYBRID SLURM:   true
HYBRID TASKS:   1
HYBRID NODES:   1
HYBRID DEFAULT PROCESSES PER NODE: node
HYBRID PROCESS MAPPING: One process per compute node
HYBRID PINNING: exact pinning enabled
ws01 process 1 of 1: THREADS 1; PIN STRATEGY none; SPINLOCK FAST
ws01 process 1 of 1: CPUMASK  
ws01 process 1 of 1: EXECUTES /home/kdichev/LPF/build-x86/tests/functional/func_lpf_exec_multiple_call_single_arg_dual_proc_hybrid_Release_debug