ICLDisco / dplasma

DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
Other
11 stars 9 forks source link

potrf_dtd_mpi and gemm_dtd_mpi produce wrong results with 1 GPU #113

Open abouteiller opened 9 months ago

abouteiller commented 9 months ago

Describe the bug

MPI POTRF DTD with 1 GPU produces wrong results. The 1-node variant is correct.

Not clear ATM if issue is in the DTD testers for GEMM and POTRF, or in PaRSEC.

Important note

After #114 this error will not manifest in normal ctest/CI (because test is forced to run on CPU only), but can still be reproduced by hand. The fix PR should add a specific test for DTD+GPU to explicitly test for this case.

Buggy output


❯ SLURM_TIMELIMIT=1 PARSEC_MCA_device_cuda_enabled=0 PARSEC_MCA_device_cuda_memory_use=10 OMPI_MCA_rmaps_base_oversubscribe=true salloc -wleconte  ctest  -R spotrf_dtd_mpi --verbose
salloc: Granted job allocation 5295
UpdateCTestConfiguration  from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Test project /home/bouteill/parsec/dplasma/build.cuda
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 287
    Start 287: dplasma_spotrf_dtd_mpi

287: Test command: /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec "-n" "4" "./testing_spotrf_dtd" "-N" "378" "-t" "19" "-x" "-v=5"
287: Working Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
287: Test timeout computed to be: 1500
287: i@00003 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287:    Parsec Streams     : 40
287:    Frequency (GHz)    : 2.20
287:    Peak Tflop/s       : 0.3520 fp64,       0.7040 fp32
287: i@00001 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287:    Parsec Streams     : 40
287:    Frequency (GHz)    : 2.20
287:    Peak Tflop/s       : 0.3520 fp64,       0.7040 fp32
287: W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
287: i@00000 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287:    Parsec Streams     : 40
287:    Frequency (GHz)    : 2.20
287:    Peak Tflop/s       : 0.3520 fp64,       0.7040 fp32
287: i@00002 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287:    Parsec Streams     : 40
287:    Frequency (GHz)    : 2.20
287:    Peak Tflop/s       : 0.3520 fp64,       0.7040 fp32
287: [   3] TIME(s)      0.04076 : PaRSEC initialized
287: [   1] TIME(s)      0.04855 : PaRSEC initialized
287: [   2] TIME(s)      0.04899 : PaRSEC initialized
287: #+++++ cores detected       : 40
287: #+++++ nodes x cores + gpu  : 4 x 40 + 0 (160+0)
287: #+++++ thread mode          : THREAD_SERIALIZED
287: #+++++ P x Q                : 4 x 1 (4/4)
287: #+++++ M x N x K|NRHS       : 378 x 378 x 1
287: #+++++ LDA                  : 378
287: #+++++ MB x NB              : 19 x 19
287: [   0] TIME(s)      0.06141 : PaRSEC initialized
287: W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
287:    This is often unintentional, and will perform poorly.
287:    Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
287:    and hide the real binding from PaRSEC; if you verified that the binding is correct,
287:    this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
287: +++ warm up ... Done
287: +++ warm up ... Done
287: +++ warm up ... Done
287: +++ Generate matrices ... Done
287: +++ warm up ... Done
287: +++ Generate matrices ... Done
287: +++ Generate matrices ... Done
287: +++ Generate matrices ... Done
287: [****] TIME(s)      0.13299 :      PxQ=   4 1   NB=   19 N=     378 :       0.135908 gflops
287: ============
287: Checking the Cholesky factorization
287: -- ||A||_oo = 4.826267e+02, ||L'L-A||_oo = 1.256157e-04
287: -- ||L'L-A||_oo/(||A||_oo.N.eps) = 1.155209e-02
287: -- Factorization is CORRECT !
287: ============
287: Checking the Residual of the solution
287: -- ||A||_oo = 4.826267e+02, ||X||_oo = 1.350270e-03, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 4.668881e-07
287: -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 1.799328e-02
287: -- Solution is CORRECT !
287: +----------------------------------------------------------------------------------------------------------------------------+
287: |         |                    |                       Data In                              |         Data Out               |
287: |Rank   0 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |  Dev  0 |      2938 | 100.00 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) | cpu-cores
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs |      2938 | 100.00 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src          0
287:    0        -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: |         |                    |                       Data In                              |         Data Out               |
287: |Rank   2 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |  Dev  0 |      2978 | 100.00 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) | cpu-cores
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs |      2978 | 100.00 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src          0
287:    0        -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: |         |                    |                       Data In                              |         Data Out               |
287: |Rank   1 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |  Dev  0 |      2963 | 100.00 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) | cpu-cores
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs |      2963 | 100.00 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src          0
287:    0        -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: |         |                    |                       Data In                              |         Data Out               |
287: |Rank   3 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |  Dev  0 |      2963 | 100.00 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) | cpu-cores
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs |      2963 | 100.00 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src          0
287:    0        -
1/1 Test #287: dplasma_spotrf_dtd_mpi ...........   Passed    1.79 sec

The following tests passed:
        dplasma_spotrf_dtd_mpi

100% tests passed, 0 tests failed out of 1

Label Time Summary:
dplasma    =   1.79 sec*proc (1 test)
mpi        =   1.79 sec*proc (1 test)

Total Test time (real) =   1.81 sec
salloc: Relinquishing job allocation 5295

     ~/parsec/dplasma/build.cuda     cleanup/ngpus-match-g *1 !2 ?13 ··············································· bouteill@methane  16:32:00   ─╮
❯ SLURM_TIMELIMIT=1 PARSEC_MCA_device_cuda_enabled=1 PARSEC_MCA_device_cuda_memory_use=10 OMPI_MCA_rmaps_base_oversubscribe=true salloc -wleconte  ctest  -R spotrf_dtd_mpi --verbose
salloc: Granted job allocation 5296
UpdateCTestConfiguration  from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Test project /home/bouteill/parsec/dplasma/build.cuda
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 287
    Start 287: dplasma_spotrf_dtd_mpi

287: Test command: /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec "-n" "4" "./testing_spotrf_dtd" "-N" "378" "-t" "19" "-x" "-v=5"
287: Working Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
287: Test timeout computed to be: 1500
287: W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
287: i@00002 GPU Device cuda(0) : Tesla V100-SXM2-32GB [capability 7.0]
287:    Location (PCI Bus/Device/Domain): 6:0.0
287:    SM                 : 80
287:    Frequency (GHz)    : 1.530000
287:    peak Tflop/s       : 7.83 fp64, 15.67 fp32,     125.34 tf32,    31.33 fp16
287:    Peak Mem Bw (GB/s) : 898.05 [Clock Rate (Ghz) 0.88 | Bus Width (bits) 4096]
287:    concurrency        : yes
287:    computeMode        : 0
287: i@00002 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287:    Parsec Streams     : 40
287:    Frequency (GHz)    : 2.20
287:    Peak Tflop/s       : 0.3520 fp64,       0.7040 fp32
287: i@00001 GPU Device cuda(0) : Tesla V100-SXM2-32GB [capability 7.0]
287:    Location (PCI Bus/Device/Domain): 6:0.0
287:    SM                 : 80
287:    Frequency (GHz)    : 1.530000
287:    peak Tflop/s       : 7.83 fp64, 15.67 fp32,     125.34 tf32,    31.33 fp16
287:    Peak Mem Bw (GB/s) : 898.05 [Clock Rate (Ghz) 0.88 | Bus Width (bits) 4096]
287:    concurrency        : yes
287:    computeMode        : 0
287: i@00000 GPU Device cuda(0) : Tesla V100-SXM2-32GB [capability 7.0]
287:    Location (PCI Bus/Device/Domain): 6:0.0
287:    SM                 : 80
287:    Frequency (GHz)    : 1.530000
287:    peak Tflop/s       : 7.83 fp64, 15.67 fp32,     125.34 tf32,    31.33 fp16
287:    Peak Mem Bw (GB/s) : 898.05 [Clock Rate (Ghz) 0.88 | Bus Width (bits) 4096]
287:    concurrency        : yes
287:    computeMode        : 0
287: i@00000 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287:    Parsec Streams     : 40
287:    Frequency (GHz)    : 2.20
287:    Peak Tflop/s       : 0.3520 fp64,       0.7040 fp32
287: i@00001 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287:    Parsec Streams     : 40
287:    Frequency (GHz)    : 2.20
287:    Peak Tflop/s       : 0.3520 fp64,       0.7040 fp32
287: i@00003 GPU Device cuda(0) : Tesla V100-SXM2-32GB [capability 7.0]
287:    Location (PCI Bus/Device/Domain): 6:0.0
287:    SM                 : 80
287:    Frequency (GHz)    : 1.530000
287:    peak Tflop/s       : 7.83 fp64, 15.67 fp32,     125.34 tf32,    31.33 fp16
287:    Peak Mem Bw (GB/s) : 898.05 [Clock Rate (Ghz) 0.88 | Bus Width (bits) 4096]
287:    concurrency        : yes
287:    computeMode        : 0
287: i@00003 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287:    Parsec Streams     : 40
287:    Frequency (GHz)    : 2.20
287:    Peak Tflop/s       : 0.3520 fp64,       0.7040 fp32
287: [   2] TIME(s)      0.80787 : PaRSEC initialized
287: [   1] TIME(s)      0.81451 : PaRSEC initialized
287: #+++++ cores detected       : 40
287: #+++++ nodes x cores + gpu  : 4 x 40 + 1 (160+4)
287: #+++++ thread mode          : THREAD_SERIALIZED
287: #+++++ P x Q                : 4 x 1 (4/4)
287: #+++++ M x N x K|NRHS       : 378 x 378 x 1
287: #+++++ LDA                  : 378
287: #+++++ MB x NB              : 19 x 19
287: [   0] TIME(s)      0.81472 : PaRSEC initialized
287: [   3] TIME(s)      0.82741 : PaRSEC initialized
287: W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
287:    This is often unintentional, and will perform poorly.
287:    Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
287:    and hide the real binding from PaRSEC; if you verified that the binding is correct,
287:    this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
287: +++ warm up ... Done
287: +++ Generate matrices ... Done
287: +++ warm up ... Done
287: +++ warm up ... Done
287: +++ warm up ... Done
287: +++ Generate matrices ... Done
287: +++ Generate matrices ... Done
287: +++ Generate matrices ... Done
287: [****] TIME(s)      0.30603 :      PxQ=   4 1   NB=   19 N=     378 :       0.059063 gflops
287: ============
287: Checking the Cholesky factorization
287: -- ||A||_oo = 4.826267e+02, ||L'L-A||_oo = 8.671714e-01
287: -- ||L'L-A||_oo/(||A||_oo.N.eps) = 7.974834e+01
287: -- Factorization is suspicious !
287: ============
287: Checking the Residual of the solution
287: -- ||A||_oo = 4.826267e+02, ||X||_oo = 1.349938e-03, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 1.503229e-04
287: -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 5.794063e+00
287: -- Solution is CORRECT !
287: +----------------------------------------------------------------------------------------------------------------------------+
287: |         |                    |                       Data In                              |         Data Out               |
287: |Rank   0 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |  Dev  0 |      2463 |  83.83 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) | cpu-cores
287: |  Dev  1 |       475 |  16.17 |     1.82MB |     274.98KB(14.77)   |       0.00 B( 0.00)   |   669.82KB |    97.30KB(14.53) | cuda(0)
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs |      2938 | 100.00 |     1.82MB |     274.98KB(14.77)   |       0.00 B( 0.00)   |   669.82KB |    97.30KB(14.53) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src          0          1
287:    0        -         97.30KB
287:    1      274.98KB     -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: |         |                    |                       Data In                              |         Data Out               |
287: |Rank   3 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |  Dev  0 |      2488 |  83.97 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) | cpu-cores
287: |  Dev  1 |       475 |  16.03 |     1.82MB |     253.83KB(13.64)   |       0.00 B( 0.00)   |   669.82KB |    76.15KB(11.37) | cuda(0)
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs |      2963 | 100.00 |     1.82MB |     253.83KB(13.64)   |       0.00 B( 0.00)   |   669.82KB |    76.15KB(11.37) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src          0          1
287:    0        -         76.15KB
287:    1      253.83KB     -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: |         |                    |                       Data In                              |         Data Out               |
287: |Rank   1 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |  Dev  0 |      2478 |  83.63 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) | cpu-cores
287: |  Dev  1 |       485 |  16.37 |     1.86MB |     269.34KB(14.15)   |       0.00 B( 0.00)   |   683.93KB |    91.66KB(13.40) | cuda(0)
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs |      2963 | 100.00 |     1.86MB |     269.34KB(14.15)   |       0.00 B( 0.00)   |   683.93KB |    91.66KB(13.40) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src          0          1
287:    0        -         91.66KB
287:    1      269.34KB     -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: |         |                    |                       Data In                              |         Data Out               |
287: |Rank   2 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |  Dev  0 |      2493 |  83.71 |     0.00 B |       0.00 B(  nan)   |       0.00 B(  nan)   |     0.00 B |     0.00 B(  nan) | cpu-cores
287: |  Dev  1 |       485 |  16.29 |     1.86MB |     262.29KB(13.78)   |       0.00 B( 0.00)   |   683.93KB |    84.61KB(12.37) | cuda(0)
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs |      2978 | 100.00 |     1.86MB |     262.29KB(13.78)   |       0.00 B( 0.00)   |   683.93KB |    84.61KB(12.37) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src          0          1
287:    0        -         84.61KB
287:    1      262.29KB     -
287: --------------------------------------------------------------------------
287: Primary job  terminated normally, but 1 process returned
287: a non-zero exit code. Per user-direction, the job has been aborted.
287: --------------------------------------------------------------------------
287: --------------------------------------------------------------------------
287: mpiexec detected that one or more processes exited with non-zero status, thus causing
287: the job to be terminated. The first process to do so was:
287:
287:   Process name: [[42679,1],1]
287:   Exit code:    1
287: --------------------------------------------------------------------------
1/1 Test #287: dplasma_spotrf_dtd_mpi ...........***Failed    3.08 sec

0% tests passed, 1 tests failed out of 1

Label Time Summary:
dplasma    =   3.08 sec*proc (1 test)
mpi        =   3.08 sec*proc (1 test)

Total Test time (real) =   3.09 sec

The following tests FAILED:
        287 - dplasma_spotrf_dtd_mpi (Failed)
Errors while running CTest
Output from these tests are in: /home/bouteill/parsec/dplasma/build.cuda/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
salloc: Relinquishing job allocation 5296

Setup

 module list                                                                                                                                               ─╯
Currently Loaded Modulefiles:
 1) ncurses/6.4/gcc-11.3.1-6rvznd           25) berkeley-db/18.1.40/gcc-11.3.1-yl6wjj                49) libvterm/0.3.1/gcc-11.3.1-we43r4
 2) htop/3.2.2/gcc-11.3.1-xm6i3t            26) readline/8.2/gcc-11.3.1-b26lae                       50) lua-lpeg/1.0.2-1/gcc-11.3.1-6e6xv6
 3) nghttp2/1.52.0/gcc-11.3.1-yzhzx5        27) gdbm/1.23/gcc-11.3.1-6u5vme                          51) msgpack-c/3.1.1/gcc-11.3.1-pzscaq
 4) zlib/1.2.13/gcc-11.3.1-uhneca           28) perl/5.38.0/gcc-11.3.1-r63sx3                        52) lua-mpack/1.0.9/gcc-11.3.1-z26msa
 5) openssl/3.1.2/gcc-11.3.1-w3u2b2         29) git/2.41.0/gcc-11.3.1-tx4xbg                         53) tree-sitter/0.20.8/gcc-11.3.1-pgy6wn
 6) curl/8.1.2/gcc-11.3.1-dhcq4d            30) cuda/11.8.0/gcc-11.3.1-vltbfy                        54) neovim/0.9.1/gcc-11.3.1-aro6rp
 7) libmd/1.0.4/gcc-11.3.1-yl2qth           31) libpciaccess/0.17/gcc-11.3.1-qp6jxc                  55) cmake/3.26.3/gcc-11.3.1-6bgawm
 8) libbsd/0.11.7/gcc-11.3.1-rxtb5h         32) hwloc/2.9.1/gcc-11.3.1-hvnu6p                        56) ninja/1.11.1/gcc-11.3.1-qf72ao
 9) expat/2.5.0/gcc-11.3.1-z3mywy           33) numactl/2.0.14/gcc-11.3.1-x35xlq                     57) gmp/6.2.1/gcc-11.3.1-c5vz5h
10) bzip2/1.0.8/gcc-11.3.1-g7buii           34) pmix/3.2.3/gcc-11.3.1-b6ek7p                         58) libffi/3.4.4/gcc-11.3.1-suq3vd
11) libiconv/1.17/gcc-11.3.1-h5tewp         35) slurm/22.05.9/gcc-11.3.1-yqiafz                      59) sqlite/3.42.0/gcc-11.3.1-trzf26
12) xz/5.4.1/gcc-11.3.1-ybherp              36) gdrcopy/2.3/gcc-11.3.1-zm6nhb                        60) util-linux-uuid/2.38.1/gcc-11.3.1-h4vnny
13) libxml2/2.10.3/gcc-11.3.1-jijod2        37) libnl/3.3.0/gcc-11.3.1-s2rfpt                        61) python/3.10.12/gcc-11.3.1-msankb
14) pigz/2.7/gcc-11.3.1-2ysjo2              38) rdma-core/41.0/gcc-11.3.1-zlh7l5                     62) gdb/13.1/gcc-11.3.1-awps3c
15) zstd/1.5.5/gcc-11.3.1-maqtnh            39) ucx/1.14.0/gcc-11.3.1-6ffd5t                         63) libevent/2.1.12/gcc-11.3.1-iqf4hw
16) tar/1.34/gcc-11.3.1-jl543d              40) openmpi/4.1.5/gcc-11.3.1-2rgaqk                      64) tmux/3.3a/gcc-11.3.1-nt2vwg
17) gettext/0.21.1/gcc-11.3.1-sgm6rr        41) gperf/3.1/gcc-11.3.1-lq7yw2                          65) cscope/15.9/gcc-11.3.1-4duk6k
18) libunistring/1.1/gcc-11.3.1-mswbrm      42) jemalloc/5.3.0/gcc-11.3.1-gnjgyl                     66) exuberant-ctags/5.8/gcc-11.3.1-f56ide
19) libidn2/2.3.4/gcc-11.3.1-kp77oe         43) libuv/1.44.1/gcc-11.3.1-ikknoi                       67) intel-oneapi-tbb/2021.10.0/gcc-11.3.1-ptv4p2
20) krb5/1.20.1/gcc-11.3.1-hb7cxy           44) unzip/6.0/gcc-11.3.1-xm5nhk                          68) intel-oneapi-mkl/2023.2.0/gcc-11.3.1-d5uffv
21) libedit/3.1-20210216/gcc-11.3.1-b2res4  45) lua-luajit-openresty/2.1-20230410/gcc-11.3.1-lgkuf6  69) mpfr/4.2.0/gcc-11.3.1-n3mu53
22) libxcrypt/4.4.35/gcc-11.3.1-v7ot4t      46) libluv/1.44.2-1/gcc-11.3.1-pyqvat                    70) mpc/1.3.1/gcc-11.3.1-2x6jci
23) openssh/9.3p1/gcc-11.3.1-jo2led         47) unibilium/2.0.0/gcc-11.3.1-az5pko                    71) gcc/13.2.0/gcc-11.3.1-ir6jns
24) pcre2/10.42/gcc-11.3.1-bk6jhf           48) libtermkey/0.22/gcc-11.3.1-gwvd67
abouteiller commented 4 days ago

Results still incorrect after #133 😠 PMIX_MCA_psec='' SLURM_TIMELIMIT=1 PARSEC_MCA_device_cuda_enabled=1 PARSEC_MCA_device_cuda_memory_use=10 OMPI_MCA_rmaps_base_oversubscribe=true salloc -wleconte -n 8 --gpus-per-task=1 /usr/bin/srun "-n" "4" "tests/testing_spotrf_dtd" "-N" "378" "-t" "19" "-x" "-v=5"