ICLDisco / dplasma

DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
Other
10 stars 8 forks source link

getrf_1d result suspicious on Guyot #115

Open abouteiller opened 5 months ago

abouteiller commented 5 months ago

Describe the bug

Result is suspicious when running ctest dgetrf_1d_mpi, the failure is deterministic, but happens only on the Guyot system (w/o GPU). Same setup will never fail on Leconte. Using variants of gcc/11,12,13; using openblas vs mkl, causes the same errors in the same cases.

To Reproduce

416aec96 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #109 from abouteiller/bugfix/dtd_gpu Aurelien Bouteiller 2 weeks ago icldisco/parsec#adabbd4d1fb580358a32d489df19fa9c05a316e1 parsec (v1.1.0-4718-gadabbd4d)

SLURM_TIMELIMIT=1 OMPI_MCA_rmaps_base_oversubscribe=true salloc -wguyot  ctest  -R dplasma_dgetrf_1d_mpi --repeat until-fail:1 --verbose  ─╯
salloc: Granted job allocation 5500
UpdateCTestConfiguration  from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
UpdateCTestConfiguration  from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Test project /home/bouteill/parsec/dplasma/build.cuda
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 340
    Start 340: dplasma_dgetrf_1d_mpi

340: Test command: /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec "-n" "4" "./testing_dgetrf_1d" "-N" "378" "-t" "19" "-P" "1" "-x" "-v=5"
340: Working Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
340: Environment variables:
340:  PARSEC_MCA_device_cuda_enabled=0
340:  PARSEC_MCA_device_hip_enabled=0
340:  PARSEC_MCA_device_level_zero_enabled=0
340:  PARSEC_MCA_device_cuda_memory_use=10
340:  PARSEC_MCA_device_hip_memory_use=10
340:  PARSEC_MCA_device_level_zero_memory_use=10
340: Test timeout computed to be: 1500
340: W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
340: [   2] TIME(s)      0.11725 : PaRSEC initialized
340: #+++++ cores detected       : 128
340: #+++++ nodes x cores + gpu  : 4 x 128 + 0 (512+0)
340: #+++++ thread mode          : THREAD_SERIALIZED
340: #+++++ P x Q                : 1 x 4 (4/4)
340: #+++++ M x N x K|NRHS       : 378 x 378 x 1
340: #+++++ LDA , LDB            : 378 , 378
340: #+++++ MB x NB , IB         : 19 x 19 , 40
340: [   0] TIME(s)      0.11894 : PaRSEC initialized
340: [   3] TIME(s)      0.11955 : PaRSEC initialized
340: [   1] TIME(s)      0.12168 : PaRSEC initialized
340: W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
340:    This is often unintentional, and will perform poorly.
340:    Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
340:    and hide the real binding from PaRSEC; if you verified that the binding is correct,
340:    this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
340: +++ Generate matrices ... Done
340: +++ Generate matrices ... Done
340: +++ Generate matrices ... Done
340: +++ Generate matrices ... Done
340: +++ Computing getrf ... [****] TIME(s)      9.45201 : dgetrf_1d    PxQxg=   1 4   0 NB=   19 N=     378 :       0.003802 gflops - ENQ&PROG&DEST      9.52389 :       0.003773 gflops - ENQ      0.04388 - DEST      0.02800
340: +----------------------------------------------------------------------------------------------------------------------------+
340: |         |                    |                       Data In                              |         Data Out               |
340: |Rank   0 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |  Dev  0 |       756 | 100.00 |     0.00 B |       0.00 B( -nan)   |       0.00 B( -nan)   |     0.00 B |     0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs |       756 | 100.00 |     0.00 B |       1.00 B(nan)   |       0.00 B(nan)   |     0.00 B |     1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: <DartMeasurement name="performance" type="numeric/double"
340:                  encoding="none" compression="none">
340: 0.0038019
340: </DartMeasurement>
340: Done.
340: +++ Computing getrf ... +----------------------------------------------------------------------------------------------------------------------------+
340: |         |                    |                       Data In                              |         Data Out               |
340: |Rank   1 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |  Dev  0 |       811 | 100.00 |     0.00 B |       0.00 B( -nan)   |       0.00 B( -nan)   |     0.00 B |     0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs |       811 | 100.00 |     0.00 B |       1.00 B(nan)   |       0.00 B(nan)   |     0.00 B |     1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: Done.
340: +++ Computing getrf ... +----------------------------------------------------------------------------------------------------------------------------+
340: |         |                    |                       Data In                              |         Data Out               |
340: |Rank   3 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |  Dev  0 |       906 | 100.00 |     0.00 B |       0.00 B( -nan)   |       0.00 B( -nan)   |     0.00 B |     0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs |       906 | 100.00 |     0.00 B |       1.00 B(nan)   |       0.00 B(nan)   |     0.00 B |     1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: Done.
340: +++ Computing getrf ... +----------------------------------------------------------------------------------------------------------------------------+
340: |         |                    |                       Data In                              |         Data Out               |
340: |Rank   2 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |  Dev  0 |       861 | 100.00 |     0.00 B |       0.00 B( -nan)   |       0.00 B( -nan)   |     0.00 B |     0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs |       861 | 100.00 |     0.00 B |       1.00 B(nan)   |       0.00 B(nan)   |     0.00 B |     1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: Done.
340: ============
340: Checking the Residual of the solution
340: -- ||A||_oo = 1.025373e+02, ||X||_oo = 1.202008e+01, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 3.394100e+01
340: -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 6.559297e+11
340: -- Solution is suspicious !
340: --------------------------------------------------------------------------
340: Primary job  terminated normally, but 1 process returned
340: a non-zero exit code. Per user-direction, the job has been aborted.
340: --------------------------------------------------------------------------
340: --------------------------------------------------------------------------
340: mpiexec detected that one or more processes exited with non-zero status, thus causing
340: the job to be terminated. The first process to do so was:
340:
340:   Process name: [[26343,1],3]
340:   Exit code:    1
340: --------------------------------------------------------------------------
1/1 Test #340: dplasma_dgetrf_1d_mpi ............***Failed   18.75 sec

0% tests passed, 1 tests failed out of 1

Label Time Summary:
dplasma    =  18.75 sec*proc (1 test)
mpi        =  18.75 sec*proc (1 test)

Total Test time (real) =  18.77 sec

The following tests FAILED:
        340 - dplasma_dgetrf_1d_mpi (Failed)
Errors while running CTest
Output from these tests are in: /home/bouteill/parsec/dplasma/build.cuda/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
salloc: Relinquishing job allocation 5500
 module list                                                                                                                               ─╯
Currently Loaded Modulefiles:
 1) ncurses/6.4/gcc-11.3.1-6rvznd           34) pmix/3.2.3/gcc-11.3.1-b6ek7p                         67) mpfr/4.2.0/gcc-11.3.1-n3mu53
 2) htop/3.2.2/gcc-11.3.1-xm6i3t            35) slurm/22.05.9/gcc-11.3.1-yqiafz                      68) mpc/1.3.1/gcc-11.3.1-2x6jci
 3) nghttp2/1.52.0/gcc-11.3.1-yzhzx5        36) gdrcopy/2.3/gcc-11.3.1-zm6nhb                        69) gcc/13.2.0/gcc-11.3.1-ir6jns
 4) zlib/1.2.13/gcc-11.3.1-uhneca           37) libnl/3.3.0/gcc-11.3.1-s2rfpt                        70) openblas/0.3.23/gcc-11.3.1-zo7k5r
 5) openssl/3.1.2/gcc-11.3.1-w3u2b2         38) rdma-core/41.0/gcc-11.3.1-zlh7l5
 6) curl/8.1.2/gcc-11.3.1-dhcq4d            39) ucx/1.14.0/gcc-11.3.1-6ffd5t
 7) libmd/1.0.4/gcc-11.3.1-yl2qth           40) openmpi/4.1.5/gcc-11.3.1-2rgaqk
 8) libbsd/0.11.7/gcc-11.3.1-rxtb5h         41) gperf/3.1/gcc-11.3.1-lq7yw2
 9) expat/2.5.0/gcc-11.3.1-z3mywy           42) jemalloc/5.3.0/gcc-11.3.1-gnjgyl
10) bzip2/1.0.8/gcc-11.3.1-g7buii           43) libuv/1.44.1/gcc-11.3.1-ikknoi
11) libiconv/1.17/gcc-11.3.1-h5tewp         44) unzip/6.0/gcc-11.3.1-xm5nhk
12) xz/5.4.1/gcc-11.3.1-ybherp              45) lua-luajit-openresty/2.1-20230410/gcc-11.3.1-lgkuf6
13) libxml2/2.10.3/gcc-11.3.1-jijod2        46) libluv/1.44.2-1/gcc-11.3.1-pyqvat
14) pigz/2.7/gcc-11.3.1-2ysjo2              47) unibilium/2.0.0/gcc-11.3.1-az5pko
15) zstd/1.5.5/gcc-11.3.1-maqtnh            48) libtermkey/0.22/gcc-11.3.1-gwvd67
16) tar/1.34/gcc-11.3.1-jl543d              49) libvterm/0.3.1/gcc-11.3.1-we43r4
17) gettext/0.21.1/gcc-11.3.1-sgm6rr        50) lua-lpeg/1.0.2-1/gcc-11.3.1-6e6xv6
18) libunistring/1.1/gcc-11.3.1-mswbrm      51) msgpack-c/3.1.1/gcc-11.3.1-pzscaq
19) libidn2/2.3.4/gcc-11.3.1-kp77oe         52) lua-mpack/1.0.9/gcc-11.3.1-z26msa
20) krb5/1.20.1/gcc-11.3.1-hb7cxy           53) tree-sitter/0.20.8/gcc-11.3.1-pgy6wn
21) libedit/3.1-20210216/gcc-11.3.1-b2res4  54) neovim/0.9.1/gcc-11.3.1-aro6rp
22) libxcrypt/4.4.35/gcc-11.3.1-v7ot4t      55) cmake/3.26.3/gcc-11.3.1-6bgawm
23) openssh/9.3p1/gcc-11.3.1-jo2led         56) ninja/1.11.1/gcc-11.3.1-qf72ao
24) pcre2/10.42/gcc-11.3.1-bk6jhf           57) gmp/6.2.1/gcc-11.3.1-c5vz5h
25) berkeley-db/18.1.40/gcc-11.3.1-yl6wjj   58) libffi/3.4.4/gcc-11.3.1-suq3vd
26) readline/8.2/gcc-11.3.1-b26lae          59) sqlite/3.42.0/gcc-11.3.1-trzf26
27) gdbm/1.23/gcc-11.3.1-6u5vme             60) util-linux-uuid/2.38.1/gcc-11.3.1-h4vnny
28) perl/5.38.0/gcc-11.3.1-r63sx3           61) python/3.10.12/gcc-11.3.1-msankb
29) git/2.41.0/gcc-11.3.1-tx4xbg            62) gdb/13.1/gcc-11.3.1-awps3c
30) cuda/11.8.0/gcc-11.3.1-vltbfy           63) libevent/2.1.12/gcc-11.3.1-iqf4hw
31) libpciaccess/0.17/gcc-11.3.1-qp6jxc     64) tmux/3.3a/gcc-11.3.1-nt2vwg
32) hwloc/2.9.1/gcc-11.3.1-hvnu6p           65) cscope/15.9/gcc-11.3.1-4duk6k
33) numactl/2.0.14/gcc-11.3.1-x35xlq        66) exuberant-ctags/5.8/gcc-11.3.1-f56ide
abouteiller commented 1 month ago

Error also seen on Apple M1 Max

bosilca commented 1 month ago

Works just fine for me on M1 and M3 Pro with Sonoma 14.4.1.

============
Checking the Residual of the solution
-- ||A||_oo = 1.025373e+02, ||X||_oo = 1.662771e+00, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 3.635980e-14
-- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 5.066797e-03
-- Solution is CORRECT !