electronic-structure / SIRIUS

Domain specific library for electronic structure calculations
BSD 3-Clause "New" or "Revised" License
121 stars 40 forks source link

[feature] Support for DLA-Future #900

Closed toxa81 closed 10 months ago

toxa81 commented 10 months ago

To me it looks good. One question: if we merge it but don't yet upgrade the spack's sirius recipe, this won't be a problem, correct?

RMeli commented 10 months ago

Yes, I don't see a problem. I would only upstream it to Spack once there is a new release of DLA-Future (containing the generalized eigensolver C API).

Running the CPU tests locally, I had a few cpu_band_parallel tests timing out, so I'd like to run them on daint/alps before merging.

toxa81 commented 10 months ago

cscs-ci run default

toxa81 commented 10 months ago

I removed cuda builds from GH ci/cd. I don't know why it is still here. Ignore it.

RMeli commented 10 months ago

@toxa81 not sure, maybe you did not pull #908 from develop before merging? Anyways, I just merged the latest develop and looks alright now.

toxa81 commented 10 months ago

utils::random() moved to sirius::random() and now defined in ./core/math_tools.hpp

RMeli commented 10 months ago

I noticed, see https://github.com/electronic-structure/SIRIUS/pull/910

toxa81 commented 10 months ago

@RMeli to me it looks good. I want to propose to mege it. Then I can finish the la:: namespace relocation to ./core and start one more huge refactoring for mdarray

RMeli commented 10 months ago

@toxa81 sorry for the delay on this PR, I was/am focusing on the integration in CP2K in the past two weeks. If this PR is a blocker for further changes/refactoring, then I think it can be merged. However, with the understanding that the integration is experimental, not fully tested, and some application tests don't pass. Most tests pass, but a few crash or time out and need further investigation.

I haven't compiled/tested a CUDA version yet.

If you do want to go ahead with the merge since it's on develop, should we add an [info] to the output (or some sort of warning) to make this clear to potential users of the develop branch?


FYI, when testing locally (CPU-only) I have the following report:

82% tests passed, 5 tests failed out of 28

Label Time Summary:
integration_test cpu_band_parallel    = 1407.10 sec*proc (28 tests)

Total Test time (real) = 1407.14 sec

The following tests FAILED:
     37 - sirius.scf_test02_cpu_band_parallel (Failed)
     57 - sirius.scf_test12_cpu_band_parallel (Failed)
     65 - sirius.scf_test17_cpu_band_parallel (Timeout)
     67 - sirius.scf_test18_cpu_band_parallel (Timeout)
     69 - sirius.scf_test19_cpu_band_parallel (Timeout)
Errors while running CTest

test02 seems to hang on daint, but haven't yet fully investigated it.

The test timing out all print out

65: vector::reserve

as the last thing.


SIRIUS has been compiled with

cmake \
  -DSIRIUS_USE_CUDA=OFF \
  -DSIRIUS_USE_MEMORY_POOL=OFF \
  -DSIRIUS_USE_SCALAPACK=ON \
  -DSIRIUS_BUILD_APPS=ON \
  -DBUILD_TESTING=ON \
  -DSIRIUS_USE_DLAF=ON \
  -DSIRIUS_USE_MKL=ON \
  -DMPIRUN="mpiexec -n 4 -bind-to core:3 pika-bind --" \
  ..

and run with

export OMP_NUM_THREADS=3
ctest -R band_parallel --timeout 300

Therefore, tests are run with 4 MPI ranks, with 3 pika and OpenMP threads each. Same issues when compiling with -DSIRIUS_USE_MEMORY_POOL=ON.


Changes to run ctests with DLA-Future:

diff --git a/apps/mini_app/CMakeLists.txt b/apps/mini_app/CMakeLists.txt
index 81e78500..98c64450 100644
--- a/apps/mini_app/CMakeLists.txt
+++ b/apps/mini_app/CMakeLists.txt
@@ -15,10 +15,10 @@ if(SIRIUS_USE_CUDA OR SIRIUS_USE_ROCM)
     set(SIRIUS_SCF_LABELS ${SIRIUS_SCF_LABELS} gpu_serial gpu_band_parallel gpu_k_point_parallel)
 endif()
 set(SIRIUS_SCF_FLAGS_gpu_serial           --control.processing_unit=gpu --control.std_evp_solver_name=cusolver --control.gen_evp_solver_name=cusolver)
-set(SIRIUS_SCF_FLAGS_gpu_band_parallel    --control.processing_unit=gpu --control.mpi_grid_dims=2:2 --control.std_evp_solver_name=scalapack --control.gen_evp_solver_name=scalapack)
+set(SIRIUS_SCF_FLAGS_gpu_band_parallel    --control.processing_unit=gpu --control.mpi_grid_dims=2:2 --control.std_evp_solver_name=dlaf --control.gen_evp_solver_name=dlaf)
 set(SIRIUS_SCF_FLAGS_gpu_k_point_parallel --control.processing_unit=gpu --control.std_evp_solver_name=cusolver --control.gen_evp_solver_name=cusolver)
 set(SIRIUS_SCF_FLAGS_cpu_serial           --control.processing_unit=cpu)
-set(SIRIUS_SCF_FLAGS_cpu_band_parallel    --control.processing_unit=cpu --control.mpi_grid_dims=2:2 --control.std_evp_solver_name=scalapack --control.gen_evp_solver_name=scalapack)
+set(SIRIUS_SCF_FLAGS_cpu_band_parallel    --control.processing_unit=cpu --control.mpi_grid_dims=2:2 --control.std_evp_solver_name=dlaf --control.gen_evp_solver_name=dlaf)