eth-cscs / COSMA

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
BSD 3-Clause "New" or "Revised" License
196 stars 27 forks source link

Crashes with the latest COSMA release #115

Closed fstein93 closed 2 years ago

fstein93 commented 2 years ago

Dear COSMA developers,

I am one of the CP2K developers and have recently upgraded our scripts to use COSMA 2.6.0 (see discussion cp2k/cp2k#2198 ). After the upgrade, all of our GPU regtests fail (see https://dashboard.cp2k.org/, testers CRAY-XC50-gnu, Performance CUDA Volta, CUDA Pascal). Our HIP tester does not make use of COSMAs GPU backend yet.

The typical backtrace looks as followed

error: GPU API call : invalid argument terminate called after throwing an instance of 'std::runtime_error' what(): GPU ERROR

Program received signal SIGABRT: Process abort signal.

Backtrace for this error:

0 0x7f5d6f019d21 in ???

1 0x7f5d6f018ef5 in ???

2 0x7f5d6ec7208f in ???

3 0x7f5d6ec7200b in ???

4 0x7f5d6ec51858 in ???

5 0x7f5d8688b910 in ???

6 0x7f5d8689738b in ???

7 0x7f5d868973f6 in ???

8 0x7f5d868976a8 in ???

9 0x55652e0befd9 in check_runtime_status

    at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/util.hpp:17

10 0x5565317398d1 in _ZN3gpu25copy_tile_to_device_asyncIdEEvRNS_12tiled_matrixIT_EEPS2_NS_10tile_coordERNS_13device_streamE

at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:46

11 0x5565317398d1 in _ZN3gpu25copy_tile_to_device_asyncIdEEvRNS_12tiled_matrixIT_EERNS_13device_bufferIS2_EENS_10tile_coordERNS_11gpu_contextEi

    at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:52

12 0x556531739d92 in _ZN3gpu11round_robinIdEEvRNS_12tiled_matrixIT_EES4_S4_RNS_13device_bufferIS2_EES7_S7_iiiS2_S2_RNS_9mm_handleIS2_EE

at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:248

13 0x55653173ac52 in _ZN3gpu4gemmIdEEvRNS_9mm_handleIT_EEPS2_S5_S5_iiiS2_S2_bb

    at /opt/cp2k-toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:468

14 0x556531702744 in _ZN5cosma14local_multiplyIdEEvPNS_13cosma_contextIT_EEPS2_S5_S5_iiiS2_S2_b

at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/local_multiply.cpp:168

15 0x5565316e8612 in _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2S2

   at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:381

16 0x5565316e801c in _ZN5cosma8parallelIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2S2

at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:867

17 0x5565316e87e0 in _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RNS_8IntervalES9_S9_S9_mRKNS_8StrategyEPNS_12communicatorES2S2

at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:408

18 0x5565316e8a7a in _ZN5cosma8multiplyIdEEvPNS_13cosma_contextIT_EERNS_11CosmaMatrixIS2_EES7_S7_RKNS_8StrategyEiS2S2

   at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/multiply.cpp:283

19 0x5565316c48a3 in _ZN5cosma6pxgemmIdEEvcciiiT_PKS1_iiPKiS3_iiS5_S1_PS1iiS5

at /opt/cp2k-toolchain/build/COSMA-v2.6.0/src/cosma/cosma_pxgemm.cpp:350

Do you have an idea what this error causes? I am happy to share further information if required.

kabicm commented 2 years ago

Hi Frederick,

Unfortunately, it seems I can't access the cscs infrastructure anymore.

Since this is not using NCCL or gpu-aware MPI, this part should not have changed since the last working version, so I am really puzzled by this.

Maybe @teonnik or @simonpintarelli could have a look?

kabicm commented 2 years ago

As @simonpintarelli also suggested, let's make sure it doesn't get out of gpu memory by setting:

export COSMA_GPU_MAX_TILE_M=2000
export COSMA_GPU_MAX_TILE_N=2000
export COSMA_GPU_MAX_TILE_K=2000

By default these values are 5k, so you can try reducing them.

However, the gpu memory footprint has not changed since the last version, so this should not be a problem.

fstein93 commented 2 years ago

Well, it also fails the regtests for which the matrix dimensions should be much smaller than 2000. For a few tests, k=0 or a process might not have any local data depending on the distribution. Can that cause this issues on GPU only?

simonpintarelli commented 2 years ago

I can't reproduce the bug using the miniapps (test.pdgemm, test.multiply). @fstein93 Do you know what the matrix sizes in the cp2k regtest are?

fstein93 commented 2 years ago

I am not familiar with all of them. I can provide more details in the following cases:

  1. QS/regtest-ri-rpa, it is n=m=83, k=76 (H2O), n=m=14, k=0 (!) or k=22 (H) and n=m=97, k=78 or k=104 (CH3).
  2. I will do some checks tomorrow with the tests lr because here, the sizes of n=m depend on the numerics.
  3. QS/regtest-gw/G0W0_H2O_PBE_periodic.inp, it is probably n=m=83, k=148.
  4. LIBTEST/test_cp_fm_gemm_01.inp check the input file, and the source code.

In general, only the GPU versions are affected, not the CPU version. The failing tests are mostly the same but not all of them fail everywhere, for instance QS/regtest-ri-rpa/RI_RPA_CH3.inp does fail on Daint but not on CUDA Pascal.

I hope that provides already a few hints.

fstein93 commented 2 years ago

Meanwhile, there are some more results for larger benchmarks on Daint on GPU (see here). The RPA benchmark is a larger version of the QS/regtest-ri-rpa test set with n=m=4352 and k=196,608. Similar matrix-matrix multiplies occur within the MP2 code where the respective regtests run smoothly (without lr).

kabicm commented 2 years ago

Thank you Frederick for more details and thanks Simon for chiming in!

@fstein93 regarding your questions above:

Simon has just tried the test cases you mentioned on Piz Daint P100 and couldn't reproduce the error. To make sure that we have the same arguments, it would be really helpful if you could:

Then, Simon could rerun it using the miniapp on daint. Would it be possible?

oschuett commented 2 years ago

It seems the crashes happen because cudaMemcpy2DAsync is called with invalid arguments.

I added a print statement at tiled_mm.cpp:96 and then ran QS/regtest-sos-mp2-lr/H2O-sos-mp2-lr.inp:

dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 184 spitch: 184 width: 184 height: 23
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 83
dpitch: 664 spitch: 664 width: 664 height: 77  <-- each line appears twice because test ran with two mpi ranks
dpitch: 664 spitch: 664 width: 664 height: 77

Looking at the docs it seems there exist multiply ways to upset cudaMemcpy2DAsync.

kabicm commented 2 years ago

Thanks @oschuett for debugging it!

Would it be possible to uncomment those 4 lines from this comment and rerun it? Then we would have all the pdgemm parameters and could run this in isolation.

oschuett commented 2 years ago

Voilà: H2O-sos-mp2-lr.txt

kabicm commented 2 years ago

@oschuett Thanks Ole for the output! In the latest commit I added now the test cases from your output with exactly the same parameters, that Simon can now run in isolation.

However, few things from your file caught my attention:

  1. It seems the error happens within cholesky decomposition?
  2. Did you link cp2k to cosma_prefixed_pxgemm library: https://github.com/eth-cscs/COSMA/blob/b5ba79e86c0c2530eafde963a1a77b6f797f27c8/src/cosma/CMakeLists.txt#L102 or to cosma_pxgemm library: https://github.com/eth-cscs/COSMA/blob/b5ba79e86c0c2530eafde963a1a77b6f797f27c8/src/cosma/CMakeLists.txt#L79

The difference is that cosma_prefixed_pxgemm only implements scalapack routines with the "cosma_" prefix, i.e. cosma_pdgemm, cosma_psgemm and the complex versions. On the other hand cosma_pxgemm implements both the prefixed versions + overwrites default scalapack routines.

Since cp2k anyway calls the cosma_pdgemm and cosma_psgemm routines, I think you should link to prefixed_cosma_pxgemm instead of cosma_pxgemm. This way, COSMA will not be used in cholesky.

fstein93 commented 2 years ago

All errors occur outside of Cholesky decompositions. In some cases (like lr), a Cholesky decomposition was carried out in advance, whereas in other cases (like RPA), it follows a Cholesky decomposition. The library test does not perform any kind of Cholesky decomposition. Interestingly, the other library tests for PDGEMM does not fail (see here).

kabicm commented 2 years ago

Thanks @fstein93 for clarifications! It seems I misunderstood the output then.

Hope Simon will be able to reproduce it by running the newly added tests.

Btw, do we know if export CUDA_LAUNCH_BLOCKING=1 resolves the issue?

oschuett commented 2 years ago

Did you link cp2k to cosma_prefixed_pxgemm library:

You can get the linker line from the regtest report:

LIBS        = -lsirius -lcusolver -lspla -lspfft -lsymspg -lhdf5 -lhdf5_hl -lz -lgsl -lelpa_openmp -lcosma_prefixed_pxgemm -lcosma -lcosta -lTiled-MM -lscalapack -lxsmmf -lxsmm -ldl -lpthread -lxcf03 -lxc -lint2 -lfftw3_mpi -lfftw3 -lfftw3_omp  -lmpifort -lmpicxx -lmpi  -lopenblas -lvori -lstdc++ -lstdc++ -lcudart -lnvrtc -lcuda -lcufft -lcublas -lrt 
kabicm commented 2 years ago

Simon managed to reproduce this errror within COSMA, we are working on it!

kabicm commented 2 years ago

@oschuett just a quick question: after you added those print statements, what is in your line: at /home/ole/git/cp2k/tools/toolchain/build/COSMA-v2.6.0/libs/Tiled-MM/src/Tiled-MM/tiled_mm.cpp:475

I want to see if the error occurred within: round_robin or within round_robin_without_copy_c

kabicm commented 2 years ago

@oschuett @fstein93 are we sure the same tests were passing with the previous COSMA version, or are these tests new?

fstein93 commented 2 years ago

@kabicm the tests passed with the previous version. There is only one which I added recently.

kabicm commented 2 years ago

@oschuett @fstein93 the latest master now passes the failing tests from cp2k. Can you try the latest master, or do I have to make a new release so that you can test it?

fstein93 commented 2 years ago

In general, we use only official releases of all libraries to ensure properly working libraries for the users. That is also how we proceed with DBCSR. Anyways, the fix is probably also relevant for your user base.

oschuett commented 2 years ago

You can open a draft pull requests in which you have install_cosma.sh use your master branch. Then we can trigger the CI tests.

kabicm commented 2 years ago

We would surely make a new release once we are sure this fixes the failing tests.

kabicm commented 2 years ago

It seems the tests are now passing, at least on Pascal. So, I guess we can make a new release now. I will just make few smaller cmake modifications and then release.

kabicm commented 2 years ago

The new version COSMA-v2.6.1 is now released. Let us know if there are any issues!

kabicm commented 2 years ago

I will close this issue now. Feel free to reopen it if there are any problems with the new version COSMA-v2.6.1.