DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
Other
11
stars
9
forks
source link
GEQRF (and derivatives) use too many workspaces on GPU #110
GEQRF (and derivatives, like LQ, SORMQR etc) use more than the hardcoded 2 GPU workspaces.
Important note
After #114 this error will not manifest in normal ctest/CI (because test is forced to run on CPU only), but can still be reproduced by hand. The fix PR should add a specific test for QR+GPU to explicitly test for this case.
125/437 Test: dplasma_sgeqrf_shm
113 Command: "/usr/bin/srun" "./testing_sgeqrf" "-M" "487" "-N" "283" "-K" "97" "-t" "56" "-x" "-v=5"
114 Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
115 "dplasma_sgeqrf_shm" start time: Jan 31 19:38 EST
116 Output:
117 ----------------------------------------------------------
118 srun: Job 4994 step creation temporarily disabled, retrying (Requested nodes are busy)
119 srun: Step created for job 4994
120 [1706747884.458034] [leconte:2566339:0] ucp_context.c:1081 UCX WARN network device 'mlx5_0:1' is not available, please use one or more of: 'docker0'
(tcp), 'enp1s0f0'(tcp), 'enp1s0f1'(tcp), 'lo'(tcp)
121 ^[[1;37;43mW@00000^[[0m /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
122 #+++++ cores detected : 40
123 #+++++ nodes x cores + gpu : 1 x 40 + 0 (40+0)
124 #+++++ thread mode : THREAD_SERIALIZED
125 #+++++ P x Q : 1 x 1 (1/1)
126 #+++++ M x N x K|NRHS : 487 x 283 x 97
127 #+++++ LDA , LDB : 487 , 487
128 #+++++ MB x NB , IB : 56 x 56 , 32
129 #+++++ KP x KQ : 4 x 1
130 ^[[1;37;41mx@00000^[[0m parsec_device_pop_workspace: user requested more than 2 GPU workspaces which is the current hard-coded limit per GPU stream
131 ^[[36m@parsec_device_pop_workspace:206 (leconte:2566339)^[[0m
132 --------------------------------------------------------------------------
133 MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
134 with errorcode -6.
135
136 NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
137 You may or may not see output from other processes, depending on
138 exactly when Open MPI kills them.
139 --------------------------------------------------------------------------
140 slurmstepd: error: *** STEP 4994.4 ON leconte CANCELLED AT 2024-02-01T00:38:06 ***
141 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
142 srun: error: leconte: task 0: Exited with exit code 250
143 <end of output>
144 Test time = 3.17 sec
145 ----------------------------------------------------------
146 Test Failed.
Proposed fix
Deprecate workspaces in parsec
Use the gpu info handles to provide more than 2 workspaces per stream
Environment (please complete the following information):
Dplasma: 416aec96 (origin/master, origin/HEAD, master) Merge pull request #109 from abouteiller/bugfix/dtd_gpu Aurelien Bouteiller 22 hours ago
Parsec: adabbd4d (origin/master, origin/HEAD, master) Merge pull request #620 from bosilca/fix/osx_warning Thomas Herault 7 days ago
Describe the bug
GEQRF (and derivatives, like LQ, SORMQR etc) use more than the hardcoded 2 GPU workspaces.
Important note
After #114 this error will not manifest in normal ctest/CI (because test is forced to run on CPU only), but can still be reproduced by hand. The fix PR should add a specific test for QR+GPU to explicitly test for this case.
To Reproduce
Ctest on Leconte
SLURM_TIMELIMIT=2 PARSEC_MCA_device_cuda_memory_use=20 OMPI_MCA_rmaps_base_oversubscribe=true salloc -N1 -wleconte ctest --rerun-failed
Proposed fix
Environment (please complete the following information):
../configure --prefix=/home/bouteill/parsec/dplasma/build.cuda --with-cuda --without-hip --enable-debug=noisier\,paranoid