DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
Other
10
stars
8
forks
source link
potrf_dtd_mpi and gemm_dtd_mpi produce wrong results with 1 GPU #113
MPI POTRF DTD with 1 GPU produces wrong results. The 1-node variant is correct.
Not clear ATM if issue is in the DTD testers for GEMM and POTRF, or in PaRSEC.
Important note
After #114 this error will not manifest in normal ctest/CI (because test is forced to run on CPU only), but can still be reproduced by hand. The fix PR should add a specific test for DTD+GPU to explicitly test for this case.
Buggy output
❯ SLURM_TIMELIMIT=1 PARSEC_MCA_device_cuda_enabled=0 PARSEC_MCA_device_cuda_memory_use=10 OMPI_MCA_rmaps_base_oversubscribe=true salloc -wleconte ctest -R spotrf_dtd_mpi --verbose
salloc: Granted job allocation 5295
UpdateCTestConfiguration from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
UpdateCTestConfiguration from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Test project /home/bouteill/parsec/dplasma/build.cuda
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 287
Start 287: dplasma_spotrf_dtd_mpi
287: Test command: /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec "-n" "4" "./testing_spotrf_dtd" "-N" "378" "-t" "19" "-x" "-v=5"
287: Working Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
287: Test timeout computed to be: 1500
287: i@00003 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287: Parsec Streams : 40
287: Frequency (GHz) : 2.20
287: Peak Tflop/s : 0.3520 fp64, 0.7040 fp32
287: i@00001 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287: Parsec Streams : 40
287: Frequency (GHz) : 2.20
287: Peak Tflop/s : 0.3520 fp64, 0.7040 fp32
287: W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
287: i@00000 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287: Parsec Streams : 40
287: Frequency (GHz) : 2.20
287: Peak Tflop/s : 0.3520 fp64, 0.7040 fp32
287: i@00002 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287: Parsec Streams : 40
287: Frequency (GHz) : 2.20
287: Peak Tflop/s : 0.3520 fp64, 0.7040 fp32
287: [ 3] TIME(s) 0.04076 : PaRSEC initialized
287: [ 1] TIME(s) 0.04855 : PaRSEC initialized
287: [ 2] TIME(s) 0.04899 : PaRSEC initialized
287: #+++++ cores detected : 40
287: #+++++ nodes x cores + gpu : 4 x 40 + 0 (160+0)
287: #+++++ thread mode : THREAD_SERIALIZED
287: #+++++ P x Q : 4 x 1 (4/4)
287: #+++++ M x N x K|NRHS : 378 x 378 x 1
287: #+++++ LDA : 378
287: #+++++ MB x NB : 19 x 19
287: [ 0] TIME(s) 0.06141 : PaRSEC initialized
287: W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
287: This is often unintentional, and will perform poorly.
287: Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
287: and hide the real binding from PaRSEC; if you verified that the binding is correct,
287: this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
287: +++ warm up ... Done
287: +++ warm up ... Done
287: +++ warm up ... Done
287: +++ Generate matrices ... Done
287: +++ warm up ... Done
287: +++ Generate matrices ... Done
287: +++ Generate matrices ... Done
287: +++ Generate matrices ... Done
287: [****] TIME(s) 0.13299 : PxQ= 4 1 NB= 19 N= 378 : 0.135908 gflops
287: ============
287: Checking the Cholesky factorization
287: -- ||A||_oo = 4.826267e+02, ||L'L-A||_oo = 1.256157e-04
287: -- ||L'L-A||_oo/(||A||_oo.N.eps) = 1.155209e-02
287: -- Factorization is CORRECT !
287: ============
287: Checking the Residual of the solution
287: -- ||A||_oo = 4.826267e+02, ||X||_oo = 1.350270e-03, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 4.668881e-07
287: -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 1.799328e-02
287: -- Solution is CORRECT !
287: +----------------------------------------------------------------------------------------------------------------------------+
287: | | | Data In | Data Out |
287: |Rank 0 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: | Dev 0 | 2938 | 100.00 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) | cpu-cores
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs | 2938 | 100.00 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src 0
287: 0 -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: | | | Data In | Data Out |
287: |Rank 2 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: | Dev 0 | 2978 | 100.00 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) | cpu-cores
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs | 2978 | 100.00 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src 0
287: 0 -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: | | | Data In | Data Out |
287: |Rank 1 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: | Dev 0 | 2963 | 100.00 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) | cpu-cores
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs | 2963 | 100.00 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src 0
287: 0 -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: | | | Data In | Data Out |
287: |Rank 3 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: | Dev 0 | 2963 | 100.00 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) | cpu-cores
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs | 2963 | 100.00 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src 0
287: 0 -
1/1 Test #287: dplasma_spotrf_dtd_mpi ........... Passed 1.79 sec
The following tests passed:
dplasma_spotrf_dtd_mpi
100% tests passed, 0 tests failed out of 1
Label Time Summary:
dplasma = 1.79 sec*proc (1 test)
mpi = 1.79 sec*proc (1 test)
Total Test time (real) = 1.81 sec
salloc: Relinquishing job allocation 5295
~/parsec/dplasma/build.cuda cleanup/ngpus-match-g *1 !2 ?13 ··············································· bouteill@methane 16:32:00 ─╮
❯ SLURM_TIMELIMIT=1 PARSEC_MCA_device_cuda_enabled=1 PARSEC_MCA_device_cuda_memory_use=10 OMPI_MCA_rmaps_base_oversubscribe=true salloc -wleconte ctest -R spotrf_dtd_mpi --verbose
salloc: Granted job allocation 5296
UpdateCTestConfiguration from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
UpdateCTestConfiguration from :/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Parse Config file:/home/bouteill/parsec/dplasma/build.cuda/DartConfiguration.tcl
Test project /home/bouteill/parsec/dplasma/build.cuda
Constructing a list of tests
Done constructing a list of tests
Updating test list for fixtures
Added 0 tests to meet fixture requirements
Checking test dependency graph...
Checking test dependency graph end
test 287
Start 287: dplasma_spotrf_dtd_mpi
287: Test command: /apps/spacks/2023-08-14/opt/spack/linux-rocky9-x86_64/gcc-11.3.1/openmpi-4.1.5-2rgaqk2wseegpmbdbbygvwrljccjaqsk/bin/mpiexec "-n" "4" "./testing_spotrf_dtd" "-N" "378" "-t" "19" "-x" "-v=5"
287: Working Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
287: Test timeout computed to be: 1500
287: W@00000 /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
287: i@00002 GPU Device cuda(0) : Tesla V100-SXM2-32GB [capability 7.0]
287: Location (PCI Bus/Device/Domain): 6:0.0
287: SM : 80
287: Frequency (GHz) : 1.530000
287: peak Tflop/s : 7.83 fp64, 15.67 fp32, 125.34 tf32, 31.33 fp16
287: Peak Mem Bw (GB/s) : 898.05 [Clock Rate (Ghz) 0.88 | Bus Width (bits) 4096]
287: concurrency : yes
287: computeMode : 0
287: i@00002 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287: Parsec Streams : 40
287: Frequency (GHz) : 2.20
287: Peak Tflop/s : 0.3520 fp64, 0.7040 fp32
287: i@00001 GPU Device cuda(0) : Tesla V100-SXM2-32GB [capability 7.0]
287: Location (PCI Bus/Device/Domain): 6:0.0
287: SM : 80
287: Frequency (GHz) : 1.530000
287: peak Tflop/s : 7.83 fp64, 15.67 fp32, 125.34 tf32, 31.33 fp16
287: Peak Mem Bw (GB/s) : 898.05 [Clock Rate (Ghz) 0.88 | Bus Width (bits) 4096]
287: concurrency : yes
287: computeMode : 0
287: i@00000 GPU Device cuda(0) : Tesla V100-SXM2-32GB [capability 7.0]
287: Location (PCI Bus/Device/Domain): 6:0.0
287: SM : 80
287: Frequency (GHz) : 1.530000
287: peak Tflop/s : 7.83 fp64, 15.67 fp32, 125.34 tf32, 31.33 fp16
287: Peak Mem Bw (GB/s) : 898.05 [Clock Rate (Ghz) 0.88 | Bus Width (bits) 4096]
287: concurrency : yes
287: computeMode : 0
287: i@00000 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287: Parsec Streams : 40
287: Frequency (GHz) : 2.20
287: Peak Tflop/s : 0.3520 fp64, 0.7040 fp32
287: i@00001 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287: Parsec Streams : 40
287: Frequency (GHz) : 2.20
287: Peak Tflop/s : 0.3520 fp64, 0.7040 fp32
287: i@00003 GPU Device cuda(0) : Tesla V100-SXM2-32GB [capability 7.0]
287: Location (PCI Bus/Device/Domain): 6:0.0
287: SM : 80
287: Frequency (GHz) : 1.530000
287: peak Tflop/s : 7.83 fp64, 15.67 fp32, 125.34 tf32, 31.33 fp16
287: Peak Mem Bw (GB/s) : 898.05 [Clock Rate (Ghz) 0.88 | Bus Width (bits) 4096]
287: concurrency : yes
287: computeMode : 0
287: i@00003 CPU Device: Intel(R) Xeon(R) CPU E5-2698 v4 @ 2.20GHz
287: Parsec Streams : 40
287: Frequency (GHz) : 2.20
287: Peak Tflop/s : 0.3520 fp64, 0.7040 fp32
287: [ 2] TIME(s) 0.80787 : PaRSEC initialized
287: [ 1] TIME(s) 0.81451 : PaRSEC initialized
287: #+++++ cores detected : 40
287: #+++++ nodes x cores + gpu : 4 x 40 + 1 (160+4)
287: #+++++ thread mode : THREAD_SERIALIZED
287: #+++++ P x Q : 4 x 1 (4/4)
287: #+++++ M x N x K|NRHS : 378 x 378 x 1
287: #+++++ LDA : 378
287: #+++++ MB x NB : 19 x 19
287: [ 0] TIME(s) 0.81472 : PaRSEC initialized
287: [ 3] TIME(s) 0.82741 : PaRSEC initialized
287: W@00000 /!\ PERFORMANCE MIGHT BE REDUCED /!\: Multiple PaRSEC processes on the same node may share the same physical core(s);
287: This is often unintentional, and will perform poorly.
287: Note that in managed environments (e.g., ALPS, jsrun), the launcher may set `cgroups`
287: and hide the real binding from PaRSEC; if you verified that the binding is correct,
287: this message can be silenced using the MCA argument `runtime_warn_slow_binding`.
287: +++ warm up ... Done
287: +++ Generate matrices ... Done
287: +++ warm up ... Done
287: +++ warm up ... Done
287: +++ warm up ... Done
287: +++ Generate matrices ... Done
287: +++ Generate matrices ... Done
287: +++ Generate matrices ... Done
287: [****] TIME(s) 0.30603 : PxQ= 4 1 NB= 19 N= 378 : 0.059063 gflops
287: ============
287: Checking the Cholesky factorization
287: -- ||A||_oo = 4.826267e+02, ||L'L-A||_oo = 8.671714e-01
287: -- ||L'L-A||_oo/(||A||_oo.N.eps) = 7.974834e+01
287: -- Factorization is suspicious !
287: ============
287: Checking the Residual of the solution
287: -- ||A||_oo = 4.826267e+02, ||X||_oo = 1.349938e-03, ||B||_oo= 5.000000e-01, ||A X - B||_oo = 1.503229e-04
287: -- ||Ax-B||_oo/((||A||_oo||x||_oo+||B||_oo).N.eps) = 5.794063e+00
287: -- Solution is CORRECT !
287: +----------------------------------------------------------------------------------------------------------------------------+
287: | | | Data In | Data Out |
287: |Rank 0 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: | Dev 0 | 2463 | 83.83 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) | cpu-cores
287: | Dev 1 | 475 | 16.17 | 1.82MB | 274.98KB(14.77) | 0.00 B( 0.00) | 669.82KB | 97.30KB(14.53) | cuda(0)
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs | 2938 | 100.00 | 1.82MB | 274.98KB(14.77) | 0.00 B( 0.00) | 669.82KB | 97.30KB(14.53) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src 0 1
287: 0 - 97.30KB
287: 1 274.98KB -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: | | | Data In | Data Out |
287: |Rank 3 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: | Dev 0 | 2488 | 83.97 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) | cpu-cores
287: | Dev 1 | 475 | 16.03 | 1.82MB | 253.83KB(13.64) | 0.00 B( 0.00) | 669.82KB | 76.15KB(11.37) | cuda(0)
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs | 2963 | 100.00 | 1.82MB | 253.83KB(13.64) | 0.00 B( 0.00) | 669.82KB | 76.15KB(11.37) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src 0 1
287: 0 - 76.15KB
287: 1 253.83KB -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: | | | Data In | Data Out |
287: |Rank 1 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: | Dev 0 | 2478 | 83.63 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) | cpu-cores
287: | Dev 1 | 485 | 16.37 | 1.86MB | 269.34KB(14.15) | 0.00 B( 0.00) | 683.93KB | 91.66KB(13.40) | cuda(0)
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs | 2963 | 100.00 | 1.86MB | 269.34KB(14.15) | 0.00 B( 0.00) | 683.93KB | 91.66KB(13.40) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src 0 1
287: 0 - 91.66KB
287: 1 269.34KB -
287: +----------------------------------------------------------------------------------------------------------------------------+
287: | | | Data In | Data Out |
287: |Rank 2 | # KERNEL | % | Required | Transfered H2D(%) | Transfered D2D(%) | Required | Transfered(%) |
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: | Dev 0 | 2493 | 83.71 | 0.00 B | 0.00 B( nan) | 0.00 B( nan) | 0.00 B | 0.00 B( nan) | cpu-cores
287: | Dev 1 | 485 | 16.29 | 1.86MB | 262.29KB(13.78) | 0.00 B( 0.00) | 683.93KB | 84.61KB(12.37) | cuda(0)
287: |---------|-----------|--------|------------|-----------------------|-----------------------|------------|-------------------|
287: |All Devs | 2978 | 100.00 | 1.86MB | 262.29KB(13.78) | 0.00 B( 0.00) | 683.93KB | 84.61KB(12.37) |
287: +----------------------------------------------------------------------------------------------------------------------------+
287:
287: Full transfer matrix:
287: dst\src 0 1
287: 0 - 84.61KB
287: 1 262.29KB -
287: --------------------------------------------------------------------------
287: Primary job terminated normally, but 1 process returned
287: a non-zero exit code. Per user-direction, the job has been aborted.
287: --------------------------------------------------------------------------
287: --------------------------------------------------------------------------
287: mpiexec detected that one or more processes exited with non-zero status, thus causing
287: the job to be terminated. The first process to do so was:
287:
287: Process name: [[42679,1],1]
287: Exit code: 1
287: --------------------------------------------------------------------------
1/1 Test #287: dplasma_spotrf_dtd_mpi ...........***Failed 3.08 sec
0% tests passed, 1 tests failed out of 1
Label Time Summary:
dplasma = 3.08 sec*proc (1 test)
mpi = 3.08 sec*proc (1 test)
Total Test time (real) = 3.09 sec
The following tests FAILED:
287 - dplasma_spotrf_dtd_mpi (Failed)
Errors while running CTest
Output from these tests are in: /home/bouteill/parsec/dplasma/build.cuda/Testing/Temporary/LastTest.log
Use "--rerun-failed --output-on-failure" to re-run the failed cases verbosely.
salloc: Relinquishing job allocation 5296
Setup
416aec96 (origin/master, origin/HEAD, master) Merge pull request #109 from abouteiller/bugfix/dtd_gpu
parsec/4df0d0cb (origin/master, origin/HEAD, master) Merge pull request #631 from abouteiller/cleanup/cosmetics
Describe the bug
MPI POTRF DTD with 1 GPU produces wrong results. The 1-node variant is correct.
Not clear ATM if issue is in the DTD testers for GEMM and POTRF, or in PaRSEC.
Important note
After #114 this error will not manifest in normal ctest/CI (because test is forced to run on CPU only), but can still be reproduced by hand. The fix PR should add a specific test for DTD+GPU to explicitly test for this case.
Buggy output
Setup