eth-cscs / DLA-Future

DLA-Future
https://eth-cscs.github.io/DLA-Future/master/
BSD 3-Clause "New" or "Revised" License
64 stars 14 forks source link

NaNs observed in reduction to band #974

Closed RMeli closed 1 year ago

RMeli commented 1 year ago

With #953 fixed, I was able to run most of the CP2K regression tests with DLAF_NEIGVEC_MIN=2 (i.e. using DLA-Future eigensolver for all matrices of size 2x2 or above). However, some tests still fail with nans.


One such batch of failing tests is QS/regtest-md-extrap/extrap-*-far (while the QS/regtest-md-extrap/extrap-* tests all pass).

Focusing on QS/regtest-md-extrap/extrap-1-far.inp (matrix of size 46), I printed out matrices (in NumPy format) at different stages of the eigensolver (input, band matrix, tridiagonal matrix, eigenvalues, and eigenvectors). nans are present in the band matrix.

After converting the input matrix to HDF5 and using #973, I was able to get some first observations:

Details

Adding some details here so they don't get lost in my notes.

Changes to DLAF to print out NumPy matrices

diff --git a/include/dlaf/eigensolver/eigensolver/impl.h b/include/dlaf/eigensolver/eigensolver/impl.h
index 7601af5e..4c67a4ad 100644
--- a/include/dlaf/eigensolver/eigensolver/impl.h
+++ b/include/dlaf/eigensolver/eigensolver/impl.h
@@ -28,6 +28,7 @@
 #include <dlaf/matrix/distribution.h>
 #include <dlaf/matrix/hdf5.h>
 #include <dlaf/matrix/matrix.h>
+#include <dlaf/matrix/print_numpy.h>
 #include <dlaf/types.h>
 #include <dlaf/util_matrix.h>

@@ -60,28 +61,23 @@ void Eigensolver<B, D, T>::call(comm::CommunicatorGrid grid, blas::Uplo uplo, Ma
   if (uplo != blas::Uplo::Lower)
     DLAF_UNIMPLEMENTED(uplo);

-  auto mat_taus = reduction_to_band<B>(grid, mat_a, band_size);
-  auto ret = band_to_tridiagonal<Backend::MC>(grid, uplo, band_size, mat_a);
+  std::ofstream fmat("mat.py");
+  dlaf::matrix::print(dlaf::format::numpy{}, "a", mat_a, fmat);

-#ifdef DLAF_WITH_HDF5
-  std::optional<matrix::internal::FileHDF5> file;
+  auto mat_taus = reduction_to_band<B>(grid, mat_a, band_size);
+  dlaf::matrix::print(dlaf::format::numpy{}, "a_band", mat_a, fmat);

-  if (getTuneParameters().debug_dump_trisolver_data) {
-    file = matrix::internal::FileHDF5(grid.fullCommunicator(), "trid-ref.h5");
-    file->write(ret.tridiagonal, "/tridiag");
-  }
-#endif
+  auto ret = band_to_tridiagonal<Backend::MC>(grid, uplo, band_size, mat_a);
+  dlaf::matrix::print(dlaf::format::numpy{}, "a_tridiag", ret.tridiagonal, fmat);

   tridiagonal_eigensolver<B>(grid, ret.tridiagonal, evals, mat_e);

-#ifdef DLAF_WITH_HDF5
-  if (getTuneParameters().debug_dump_trisolver_data) {
-    file->write(evals, "/evals");
-    file->write(mat_e, "/evecs");
-  }
-#endif
-
   bt_band_to_tridiagonal<B>(grid, band_size, mat_e, ret.hh_reflectors);
   bt_reduction_to_band<B>(grid, band_size, mat_e, mat_a, mat_taus);
+  
+  dlaf::matrix::print(dlaf::format::numpy{}, "evals", evals, fmat);
+  dlaf::matrix::print(dlaf::format::numpy{}, "evecs", mat_e, fmat);
+ 

Conversion of NumPy matrices to HDF5

import mat
import numpy as np
import h5py

h5f = h5py.File('input.h5', 'w')
h5f.create_dataset("a", data=mat.a.T[:, :, np.newaxis])
h5f.close()

Run miniapp

Needs #973.

miniapp/miniapp_eigensolver --pika:bind=none --pika:threads=12 --grid-rows=1 --grid-cols=1 --nruns 1 --nwarmups 0 --block-size 32 --type s --input-file ../../cp2k/input.h5 --output-file output.h5
RMeli commented 1 year ago

With block_size=32 (CP2K's default):

Figure_1

RMeli commented 1 year ago

This issue seems to affect both the distributed and local implementations.