DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
getrf_1d result suspicious on Guyot #115

Open abouteiller opened 9 months ago

abouteiller commented 9 months ago

Describe the bug

Result is suspicious when running ctest dgetrf_1d_mpi, the failure is deterministic, but happens only on the Guyot system (w/o GPU). Same setup will never fail on Leconte. Using variants of gcc/11,12,13; using openblas vs mkl, causes the same errors in the same cases.

To Reproduce

416aec96 (HEAD -> master, origin/master, origin/HEAD) Merge pull request #109 from abouteiller/bugfix/dtd_gpu Aurelien Bouteiller 2 weeks ago icldisco/parsec#adabbd4d1fb580358a32d489df19fa9c05a316e1 parsec (v1.1.0-4718-gadabbd4d)

SLURM_TIMELIMIT=1 OMPI_MCA_rmaps_base_oversubscribe=true salloc -wguyot  ctest  -R dplasma_dgetrf_1d_mpi --repeat until-fail:1 --verbose  ─╯
salloc: Granted job allocation 5500
test 340
    Start 340: dplasma_dgetrf_1d_mpi

340: Test command: /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec "-n" "4" "./testing_dgetrf_1d" "-N" "378" "-t" "19" "-P" "1" "-x" "-v=5"
340: Working Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
340: Environment variables:
340:  PARSEC_MCA_device_cuda_enabled=0
340:  PARSEC_MCA_device_hip_enabled=0
340:  PARSEC_MCA_device_level_zero_enabled=0
340:  PARSEC_MCA_device_cuda_memory_use=10
340:  PARSEC_MCA_device_hip_memory_use=10
340:  PARSEC_MCA_device_level_zero_memory_use=10
340: [   2] TIME(s)      0.11725 : PaRSEC initialized
340: #+++++ cores detected       : 128
340: #+++++ nodes x cores + gpu  : 4 x 128 + 0 (512+0)
340: #+++++ thread mode          : THREAD_SERIALIZED
340: #+++++ P x Q                : 1 x 4 (4/4)
340: #+++++ M x N x K|NRHS       : 378 x 378 x 1
340: #+++++ LDA , LDB            : 378 , 378
340: #+++++ MB x NB , IB         : 19 x 19 , 40
340: [   0] TIME(s)      0.11894 : PaRSEC initialized
340: [   3] TIME(s)      0.11955 : PaRSEC initialized
340: [   1] TIME(s)      0.12168 : PaRSEC initialized
340: W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
340:    This is often unintentional, and will perform poorly.
340:    Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
340:    and hide the real binding from PaRSEC; if you verified that the binding is correct,
340:    this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
340: +++ Generate matrices ... Done
340: +++ Generate matrices ... Done
340: +++ Generate matrices ... Done
340: +++ Generate matrices ... Done
340: +++ Computing getrf ... [****] TIME(s)      9.45201 : dgetrf_1d    PxQxg=   1 4   0 NB=   19 N=     378 :       0.003802 gflops - ENQ&PROG&DEST      9.52389 :       0.003773 gflops - ENQ      0.04388 - DEST      0.02800
340: +----------------------------------------------------------------------------------------------------------------------------+
340: |         |                    |                       Data In                              |         Data Out               |
340: |Rank   0 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |  Dev  0 |       756 | 100.00 |     0.00 B |       0.00 B( -nan)   |       0.00 B( -nan)   |     0.00 B |     0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs |       756 | 100.00 |     0.00 B |       1.00 B(nan)   |       0.00 B(nan)   |     0.00 B |     1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: Done.
340: +++ Computing getrf ... +----------------------------------------------------------------------------------------------------------------------------+
340: |         |                    |                       Data In                              |         Data Out               |
340: |Rank   1 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |  Dev  0 |       811 | 100.00 |     0.00 B |       0.00 B( -nan)   |       0.00 B( -nan)   |     0.00 B |     0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs |       811 | 100.00 |     0.00 B |       1.00 B(nan)   |       0.00 B(nan)   |     0.00 B |     1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: Done.
340: +++ Computing getrf ... +----------------------------------------------------------------------------------------------------------------------------+
340: |         |                    |                       Data In                              |         Data Out               |
340: |Rank   3 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |  Dev  0 |       906 | 100.00 |     0.00 B |       0.00 B( -nan)   |       0.00 B( -nan)   |     0.00 B |     0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs |       906 | 100.00 |     0.00 B |       1.00 B(nan)   |       0.00 B(nan)   |     0.00 B |     1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: Done.
340: +++ Computing getrf ... +----------------------------------------------------------------------------------------------------------------------------+
340: |         |                    |                       Data In                              |         Data Out               |
340: |Rank   2 |  # KERNEL |    %   |  Required  |   Transfered H2D(%)   |   Transfered D2D(%)   |  Required  |   Transfered(%)   |
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |  Dev  0 |       861 | 100.00 |     0.00 B |       0.00 B( -nan)   |       0.00 B( -nan)   |     0.00 B |     0.00 B( -nan) | cpu-cores
340: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
340: |All Devs |       861 | 100.00 |     0.00 B |       1.00 B(nan)   |       0.00 B(nan)   |     0.00 B |     1.00 B(nan) |
340: +----------------------------------------------------------------------------------------------------------------------------+
340: Done.
340: Checking the Residual of the solution
340: -- ||A||_oo = 1.025373e+02, ||X||_oo = 1.202008e+01, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 3.394100e+01
340: -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 6.559297e+11
340: -- Solution is suspicious !
1/1 Test #340: dplasma_dgetrf_1d_mpi ............***Failed   18.75 sec

abouteiller commented 6 months ago

Error also seen on Apple M1 Max

bosilca commented 6 months ago

Works just fine for me on M1 and M3 Pro with Sonoma 14.4.1.

Checking the Residual of the solution
-- ||A||_oo = 1.025373e+02, ||X||_oo = 1.662771e+00, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 3.635980e-14
-- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 5.066797e-03
-- Solution is CORRECT !