Closed fstein93 closed 2 years ago
Hi Frederick,
Unfortunately, it seems I can't access the cscs infrastructure anymore.
Since this is not using NCCL or gpu-aware MPI, this part should not have changed since the last working version, so I am really puzzled by this.
Maybe @teonnik or @simonpintarelli could have a look?
As @simonpintarelli also suggested, let's make sure it doesn't get out of gpu memory by setting:
export COSMA_GPU_MAX_TILE_M=2000
export COSMA_GPU_MAX_TILE_N=2000
export COSMA_GPU_MAX_TILE_K=2000
By default these values are 5k, so you can try reducing them.
However, the gpu memory footprint has not changed since the last version, so this should not be a problem.
Well, it also fails the regtests for which the matrix dimensions should be much smaller than 2000. For a few tests, k=0 or a process might not have any local data depending on the distribution. Can that cause this issues on GPU only?
I can't reproduce the bug using the miniapps (test.pdgemm
, test.multiply
).
@fstein93 Do you know what the matrix sizes in the cp2k regtest are?
I am not familiar with all of them. I can provide more details in the following cases:
In general, only the GPU versions are affected, not the CPU version. The failing tests are mostly the same but not all of them fail everywhere, for instance QS/regtest-ri-rpa/RI_RPA_CH3.inp does fail on Daint but not on CUDA Pascal.
I hope that provides already a few hints.
Meanwhile, there are some more results for larger benchmarks on Daint on GPU (see here). The RPA benchmark is a larger version of the QS/regtest-ri-rpa test set with n=m=4352 and k=196,608. Similar matrix-matrix multiplies occur within the MP2 code where the respective regtests run smoothly (without lr).
Thank you Frederick for more details and thanks Simon for chiming in!
@fstein93 regarding your questions above:
Simon has just tried the test cases you mentioned on Piz Daint P100 and couldn't reproduce the error. To make sure that we have the same arguments, it would be really helpful if you could:
uncomment these lines:
rerun some problematic RPA. When those lines are uncommented, all the parameters of each pdgemm call will be written in the output.
send us the output file
Then, Simon could rerun it using the miniapp on daint. Would it be possible?
It seems the crashes happen because cudaMemcpy2DAsync
is called with invalid arguments.
I added a print statement at tiled_mm.cpp:96 and then ran QS/regtest-sos-mp2-lr/H2O-sos-mp2-lr.inp
:
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 77 <-- each line appears twice because test ran with two mpi ranks
dpitch: 664 spitch: 664 width: 664 height: 77
Looking at the docs it seems there exist multiply ways to upset cudaMemcpy2DAsync
.
Thanks @oschuett for debugging it!
Would it be possible to uncomment those 4 lines from this comment and rerun it? Then we would have all the pdgemm parameters and could run this in isolation.
Voilà: H2O-sos-mp2-lr.txt
@oschuett Thanks Ole for the output! In the latest commit I added now the test cases from your output with exactly the same parameters, that Simon can now run in isolation.
However, few things from your file caught my attention:
cosma_prefixed_pxgemm
library:
https://github.com/eth-cscs/COSMA/blob/b5ba79e86c0c2530eafde963a1a77b6f797f27c8/src/cosma/CMakeLists.txt#L102 or to cosma_pxgemm
library:
https://github.com/eth-cscs/COSMA/blob/b5ba79e86c0c2530eafde963a1a77b6f797f27c8/src/cosma/CMakeLists.txt#L79The difference is that cosma_prefixed_pxgemm
only implements scalapack routines with the "cosma_" prefix, i.e. cosma_pdgemm
, cosma_psgemm
and the complex versions. On the other hand cosma_pxgemm
implements both the prefixed versions + overwrites default scalapack routines.
Since cp2k anyway calls the cosma_pdgemm
and cosma_psgemm
routines, I think you should link to prefixed_cosma_pxgemm
instead of cosma_pxgemm
. This way, COSMA will not be used in cholesky.
All errors occur outside of Cholesky decompositions. In some cases (like lr), a Cholesky decomposition was carried out in advance, whereas in other cases (like RPA), it follows a Cholesky decomposition. The library test does not perform any kind of Cholesky decomposition. Interestingly, the other library tests for PDGEMM does not fail (see here).
Thanks @fstein93 for clarifications! It seems I misunderstood the output then.
Hope Simon will be able to reproduce it by running the newly added tests.
Btw, do we know if export CUDA_LAUNCH_BLOCKING=1
resolves the issue?
Did you link cp2k to cosma_prefixed_pxgemm library:
You can get the linker line from the regtest report:
LIBS = -lsirius -lcusolver -lspla -lspfft -lsymspg -lhdf5 -lhdf5_hl -lz -lgsl -lelpa_openmp -lcosma_prefixed_pxgemm -lcosma -lcosta -lTiled-MM -lscalapack -lxsmmf -lxsmm -ldl -lpthread -lxcf03 -lxc -lint2 -lfftw3_mpi -lfftw3 -lfftw3_omp -lmpifort -lmpicxx -lmpi -lopenblas -lvori -lstdc++ -lstdc++ -lcudart -lnvrtc -lcuda -lcufft -lcublas -lrt
Simon managed to reproduce this errror within COSMA, we are working on it!
@oschuett just a quick question: after you added those print statements, what is in your line:
at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:475
I want to see if the error occurred within: round_robin or within round_robin_without_copy_c
@oschuett @fstein93 are we sure the same tests were passing with the previous COSMA version, or are these tests new?
@kabicm the tests passed with the previous version. There is only one which I added recently.
@oschuett @fstein93 the latest master now passes the failing tests from cp2k. Can you try the latest master, or do I have to make a new release so that you can test it?
In general, we use only official releases of all libraries to ensure properly working libraries for the users. That is also how we proceed with DBCSR. Anyways, the fix is probably also relevant for your user base.
You can open a draft pull requests in which you have install_cosma.sh use your master branch. Then we can trigger the CI tests.
We would surely make a new release once we are sure this fixes the failing tests.
It seems the tests are now passing, at least on Pascal. So, I guess we can make a new release now. I will just make few smaller cmake modifications and then release.
The new version COSMA-v2.6.1 is now released. Let us know if there are any issues!
I will close this issue now. Feel free to reopen it if there are any problems with the new version COSMA-v2.6.1.
Dear COSMA developers,
I am one of the CP2K developers and have recently upgraded our scripts to use COSMA 2.6.0 (see discussion cp2k/cp2k#2198 ). After the upgrade, all of our GPU regtests fail (see https://dashboard.cp2k.org/, testers CRAY-XC50-gnu, Performance CUDA Volta, CUDA Pascal). Our HIP tester does not make use of COSMAs GPU backend yet.
The typical backtrace looks as followed
error: GPU API call : invalid argument terminate called after throwing an instance of 'std::runtime_error' what(): GPU ERROR
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
0 0x7f5d6f019d21 in ???
1 0x7f5d6f018ef5 in ???
2 0x7f5d6ec7208f in ???
3 0x7f5d6ec7200b in ???
4 0x7f5d6ec51858 in ???
5 0x7f5d8688b910 in ???
6 0x7f5d8689738b in ???
7 0x7f5d868973f6 in ???
8 0x7f5d868976a8 in ???
9 0x55652e0befd9 in check_runtime_status
10 0x5565317398d1 in _ZN3gpu25copy_tile_to_device_asyncIdEEvRNS_12tiled_matrixIT_EEPS2_NS_10tile_coordERNS_13device_streamE
11 0x5565317398d1 in _ZN3gpu25copy_tile_to_device_asyncIdEEvRNS_12tiled_matrixIT_EERNS_13device_bufferIS2_EENS_10tile_coordERNS_11gpu_contextEi
12 0x556531739d92 in _ZN3gpu11round_robinIdEEvRNS_12tiled_matrixIT_EES4_S4_RNS_13device_bufferIS2_EES7_S7_iiiS2_S2_RNS_9mm_handleIS2_EE
13 0x55653173ac52 in _ZN3gpu4gemmIdEEvRNS_9mm_handleIT_EEPS2_S5_S5_iiiS2_S2_bb
14 0x556531702744 in _ZN5cosma14local_multiplyIdEEvPNS_13cosma_contextIT_EEPS2_S5_S5_iiiS2_S2_b
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/local_multiply.cpp:168
15 0x5565316e8612 in _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2S2
16 0x5565316e801c in _ZN5cosma8parallelIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2S2
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:867
17 0x5565316e87e0 in _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2S2
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:408
18 0x5565316e8a7a in _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RKNS_8StrategyEiS2S2
19 0x5565316c48a3 in _ZN5cosma6pxgemmIdEEvcciiiT_PKS1_iiPKiS3_iiS5_S1_PS1iiS5
at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/cosma_pxgemm.cpp:350
Do you have an idea what this error causes? I am happy to share further information if required.