eth-cscs / COSMA

Distributed Communication-Optimal Matrix-Matrix Multiplication Algorithm
BSD 3-Clause "New" or "Revised" License
196 stars 27 forks source link

(Still) Excessive memory usage #118

Open fstein93 opened 2 years ago

fstein93 commented 2 years ago

Dear authors,

I am one of the CP2K developers and I am working on our quartically-scaling SOS-MP2 and RPA implementations. Marko Kabic used energy-only calculations with RPA to benchmark COSMA (test system: 128 water molecules). I am currently implementing gradients for these methods. I know that my gradient implementation (available in the CP2K master trunk) requires roughly 3-4 times the memory of an energy-only calculation. I am testing the code on the GPU section of Daint. The code runs well with ScaLapack (libsci_acc). I can run my code with COSMA on a smaller system (up to 64 water molecules) and have a decent acceleration of PDGEMM calls compared to ScaLapack. Unfortunately, I cannot run larger systems (like 128 water molecules) even on 1000 nodes.

A gradient calculation consists of a set of two calls to PDGEMM with the following global sizes in the case of 128 H2O molecules:

  1. n=m=17,408 and k=3,473,408 (also in case of energy-only calculations)
  2. n=3,473,408 and m=k=17,408 (not required in case of energy-only calculations).

I observe both, out-of-memory events on the GPU and on the CPU depending on the setup when COSMA is called.

My questions are:

  1. What are COSMA's memory requirements or at least what scaling behavior do I have to expect?
  2. Is it possible for you to add a hint displaying the actual amount of missing memory in case of COSMA being able to catch the OOM event?
  3. Is it possible to provide a function to ask COSMA to release its buffers to use the idle resources of COSMA for other operations?

EDIT: I can run energy-only calculations with 128 water molecules (just PDGEMM step 1) on 64 nodes. I can run the calculations on 2048 Daint nodes. Nevertheless, the memory requirements are extremely high and it is very frustrating (and a waste of resources) to find a suitable amount of nodes for a given calculation.

EDIT2: The calculation with COSMA on 2048 nodes requires 3 times the resources than with ScaLapack on 128 nodes.

airmler commented 2 years ago

i am not a cosma developer but I can give an advice: simply set export COSMA_CPU_MAX_MEMORY=XXX to a value around 2-3 times the value you need to store the matrices. This should be enough to find a reasonable setting for COSMA and you should outperform ScaLAPACK (at least for the largeK case). ScaLAPACK should roughly need twice the memory as it uses the SUMMA algorithm.

fstein93 commented 2 years ago

It does not help with the default settings. I could get it running by simply setting COSMA_ADAPT_STRATEGY=OFF. But I wonder why the default does not capture it properly.

ajaypanyala commented 1 year ago

I have the same issue with the GPU runs on NERSC Perlmutter. I am running the cosma matrix-multiply miniapp with m=n=k=25000. It fails with OOM errors on even 100 nodes. I built COSMA with the regular CUDA options (no NCCL or GPU-aware MPI).