Open fstein93 opened 2 years ago
i am not a cosma developer but I can give an advice:
simply set
export COSMA_CPU_MAX_MEMORY=XXX
to a value around 2-3 times the value you need to store the matrices. This should be enough to find a reasonable setting for COSMA and you should outperform ScaLAPACK (at least for the largeK case).
ScaLAPACK should roughly need twice the memory as it uses the SUMMA algorithm.
It does not help with the default settings. I could get it running by simply setting COSMA_ADAPT_STRATEGY=OFF. But I wonder why the default does not capture it properly.
I have the same issue with the GPU runs on NERSC Perlmutter. I am running the cosma matrix-multiply miniapp with m=n=k=25000. It fails with OOM errors on even 100 nodes. I built COSMA with the regular CUDA options (no NCCL or GPU-aware MPI).
Dear authors,
I am one of the CP2K developers and I am working on our quartically-scaling SOS-MP2 and RPA implementations. Marko Kabic used energy-only calculations with RPA to benchmark COSMA (test system: 128 water molecules). I am currently implementing gradients for these methods. I know that my gradient implementation (available in the CP2K master trunk) requires roughly 3-4 times the memory of an energy-only calculation. I am testing the code on the GPU section of Daint. The code runs well with ScaLapack (libsci_acc). I can run my code with COSMA on a smaller system (up to 64 water molecules) and have a decent acceleration of PDGEMM calls compared to ScaLapack. Unfortunately, I cannot run larger systems (like 128 water molecules) even on 1000 nodes.
A gradient calculation consists of a set of two calls to PDGEMM with the following global sizes in the case of 128 H2O molecules:
I observe both, out-of-memory events on the GPU and on the CPU depending on the setup when COSMA is called.
My questions are:
EDIT: I can run energy-only calculations with 128 water molecules (just PDGEMM step 1) on 64 nodes. I can run the calculations on 2048 Daint nodes. Nevertheless, the memory requirements are extremely high and it is very frustrating (and a waste of resources) to find a suitable amount of nodes for a given calculation.
EDIT2: The calculation with COSMA on 2048 nodes requires 3 times the resources than with ScaLapack on 128 nodes.