ICLDisco / dplasma

DPLASMA is a highly optimized, accelerator-aware, implementation of a dense linear algebra package for distributed heterogeneous systems. It is designed to deliver sustained performance for distributed systems where each node featuring multiple sockets of multicore processors, and if available, accelerators, using the PaRSEC runtime as a backend.
Other
11 stars 9 forks source link

GEQRF (and derivatives) use too many workspaces on GPU #110

Open abouteiller opened 8 months ago

abouteiller commented 8 months ago

Describe the bug

GEQRF (and derivatives, like LQ, SORMQR etc) use more than the hardcoded 2 GPU workspaces.

Important note

After #114 this error will not manifest in normal ctest/CI (because test is forced to run on CPU only), but can still be reproduced by hand. The fix PR should add a specific test for QR+GPU to explicitly test for this case.

To Reproduce

Ctest on Leconte SLURM_TIMELIMIT=2 PARSEC_MCA_device_cuda_memory_use=20 OMPI_MCA_rmaps_base_oversubscribe=true salloc -N1 -wleconte ctest --rerun-failed

125/437 Test: dplasma_sgeqrf_shm
 113 Command: "/usr/bin/srun" "./testing_sgeqrf" "-M" "487" "-N" "283" "-K" "97" "-t" "56" "-x" "-v=5"
 114 Directory: /home/bouteill/parsec/dplasma/build.cuda/tests
 115 "dplasma_sgeqrf_shm" start time: Jan 31 19:38 EST
 116 Output:
 117 ----------------------------------------------------------
 118 srun: Job 4994 step creation temporarily disabled, retrying (Requested nodes are busy)
 119 srun: Step created for job 4994
 120 [1706747884.458034] [leconte:2566339:0]     ucp_context.c:1081 UCX  WARN  network device 'mlx5_0:1' is not available, please use one or more of: 'docker0'
     (tcp), 'enp1s0f0'(tcp), 'enp1s0f1'(tcp), 'lo'(tcp)
 121 ^[[1;37;43mW@00000^[[0m /!\ DEBUG LEVEL WILL PROBABLY REDUCE THE PERFORMANCE OF THIS RUN /!\.
 122 #+++++ cores detected       : 40
 123 #+++++ nodes x cores + gpu  : 1 x 40 + 0 (40+0)
 124 #+++++ thread mode          : THREAD_SERIALIZED
 125 #+++++ P x Q                : 1 x 1 (1/1)
 126 #+++++ M x N x K|NRHS       : 487 x 283 x 97
 127 #+++++ LDA , LDB            : 487 , 487
 128 #+++++ MB x NB , IB         : 56 x 56 , 32
 129 #+++++ KP x KQ              : 4 x 1
 130 ^[[1;37;41mx@00000^[[0m parsec_device_pop_workspace: user requested more than 2 GPU workspaces which is the current hard-coded limit per GPU stream
 131  ^[[36m@parsec_device_pop_workspace:206   (leconte:2566339)^[[0m
 132 --------------------------------------------------------------------------
 133 MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
 134 with errorcode -6.
 135
 136 NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
 137 You may or may not see output from other processes, depending on
 138 exactly when Open MPI kills them.
 139 --------------------------------------------------------------------------
 140 slurmstepd: error: *** STEP 4994.4 ON leconte CANCELLED AT 2024-02-01T00:38:06 ***
 141 srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
 142 srun: error: leconte: task 0: Exited with exit code 250
 143 <end of output>
 144 Test time =   3.17 sec
 145 ----------------------------------------------------------
 146 Test Failed.

Proposed fix

Environment (please complete the following information):

Currently Loaded Modulefiles:
 1) ncurses/6.4/gcc-11.3.1-6rvznd           25) berkeley-db/18.1.40/gcc-11.3.1-yl6wjj                49) libvterm/0.3.1/gcc-11.3.1-we43r4
 2) htop/3.2.2/gcc-11.3.1-xm6i3t            26) readline/8.2/gcc-11.3.1-b26lae                       50) lua-lpeg/1.0.2-1/gcc-11.3.1-6e6xv6
 3) nghttp2/1.52.0/gcc-11.3.1-yzhzx5        27) gdbm/1.23/gcc-11.3.1-6u5vme                          51) msgpack-c/3.1.1/gcc-11.3.1-pzscaq
 4) zlib/1.2.13/gcc-11.3.1-uhneca           28) perl/5.38.0/gcc-11.3.1-r63sx3                        52) lua-mpack/1.0.9/gcc-11.3.1-z26msa
 5) openssl/3.1.2/gcc-11.3.1-w3u2b2         29) git/2.41.0/gcc-11.3.1-tx4xbg                         53) tree-sitter/0.20.8/gcc-11.3.1-pgy6wn
 6) curl/8.1.2/gcc-11.3.1-dhcq4d            30) cuda/11.8.0/gcc-11.3.1-vltbfy                        54) neovim/0.9.1/gcc-11.3.1-aro6rp
 7) libmd/1.0.4/gcc-11.3.1-yl2qth           31) libpciaccess/0.17/gcc-11.3.1-qp6jxc                  55) cmake/3.26.3/gcc-11.3.1-6bgawm
 8) libbsd/0.11.7/gcc-11.3.1-rxtb5h         32) hwloc/2.9.1/gcc-11.3.1-hvnu6p                        56) ninja/1.11.1/gcc-11.3.1-qf72ao
 9) expat/2.5.0/gcc-11.3.1-z3mywy           33) numactl/2.0.14/gcc-11.3.1-x35xlq                     57) gmp/6.2.1/gcc-11.3.1-c5vz5h
10) bzip2/1.0.8/gcc-11.3.1-g7buii           34) pmix/3.2.3/gcc-11.3.1-b6ek7p                         58) libffi/3.4.4/gcc-11.3.1-suq3vd
11) libiconv/1.17/gcc-11.3.1-h5tewp         35) slurm/22.05.9/gcc-11.3.1-yqiafz                      59) sqlite/3.42.0/gcc-11.3.1-trzf26
12) xz/5.4.1/gcc-11.3.1-ybherp              36) gdrcopy/2.3/gcc-11.3.1-zm6nhb                        60) util-linux-uuid/2.38.1/gcc-11.3.1-h4vnny
13) libxml2/2.10.3/gcc-11.3.1-jijod2        37) libnl/3.3.0/gcc-11.3.1-s2rfpt                        61) python/3.10.12/gcc-11.3.1-msankb
14) pigz/2.7/gcc-11.3.1-2ysjo2              38) rdma-core/41.0/gcc-11.3.1-zlh7l5                     62) gdb/13.1/gcc-11.3.1-awps3c
15) zstd/1.5.5/gcc-11.3.1-maqtnh            39) ucx/1.14.0/gcc-11.3.1-6ffd5t                         63) libevent/2.1.12/gcc-11.3.1-iqf4hw
16) tar/1.34/gcc-11.3.1-jl543d              40) openmpi/4.1.5/gcc-11.3.1-2rgaqk                      64) tmux/3.3a/gcc-11.3.1-nt2vwg
17) gettext/0.21.1/gcc-11.3.1-sgm6rr        41) gperf/3.1/gcc-11.3.1-lq7yw2                          65) cscope/15.9/gcc-11.3.1-4duk6k
18) libunistring/1.1/gcc-11.3.1-mswbrm      42) jemalloc/5.3.0/gcc-11.3.1-gnjgyl                     66) exuberant-ctags/5.8/gcc-11.3.1-f56ide
19) libidn2/2.3.4/gcc-11.3.1-kp77oe         43) libuv/1.44.1/gcc-11.3.1-ikknoi                       67) intel-oneapi-tbb/2021.10.0/gcc-11.3.1-ptv4p2
20) krb5/1.20.1/gcc-11.3.1-hb7cxy           44) unzip/6.0/gcc-11.3.1-xm5nhk                          68) intel-oneapi-mkl/2023.2.0/gcc-11.3.1-d5uffv
21) libedit/3.1-20210216/gcc-11.3.1-b2res4  45) lua-luajit-openresty/2.1-20230410/gcc-11.3.1-lgkuf6  69) mpfr/4.2.0/gcc-11.3.1-n3mu53
22) libxcrypt/4.4.35/gcc-11.3.1-v7ot4t      46) libluv/1.44.2-1/gcc-11.3.1-pyqvat                    70) mpc/1.3.1/gcc-11.3.1-2x6jci
23) openssh/9.3p1/gcc-11.3.1-jo2led         47) unibilium/2.0.0/gcc-11.3.1-az5pko                    71) gcc/13.2.0/gcc-11.3.1-ir6jns
24) pcre2/10.42/gcc-11.3.1-bk6jhf           48) libtermkey/0.22/gcc-11.3.1-gwvd67