dealii / candi

candi - (Compile & Install) - Downloads, configures, builds and installs deal.II
GNU Lesser General Public License v3.0
63 stars 61 forks source link

Problems with Trilinos TPetra instantiations when using Intel MKL #390

Open gassmoeller opened 4 months ago

gassmoeller commented 4 months ago

We have problems installing trilinos on TACC Frontera with the latest candi version (see bug report in the ASPECT forum). We see errors of the sort:

ld.bfd: /work2/10103/hx38324/frontera/libs/trilinos-release-14-4-0/lib/libstratimikosbelos.so.14.4: undefined reference to `Tpetra::MultiVector<float, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::randomize()'
ld.bfd: /work2/10103/hx38324/frontera/libs/trilinos-release-14-4-0/lib/libstratimikosbelos.so.14.4: undefined reference to `Tpetra::DistObject<float, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::packAndPrepare(Tpetra::SrcDistObject const&, Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::DualView<float*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>&, Kokkos::DualView<unsigned long*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, unsigned long&)'
ld.bfd: /work2/10103/hx38324/frontera/libs/trilinos-release-14-4-0/lib/libstratimikosbelos.so.14.4: undefined reference to `Tpetra::MultiVector<float, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::subViewNonConst(Teuchos::Range1D const&)'
ld.bfd: /work2/10103/hx38324/frontera/libs/trilinos-release-14-4-0/lib/libstratimikosbelos.so.14.4: undefined reference to `Tpetra::MultiVector<float, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::update(float const&, Tpetra::MultiVector<float, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> > const&, float const&)'
ld.bfd: /work2/10103/hx38324/frontera/libs/trilinos-release-14-4-0/lib/libstratimikosbelos.so.14.4: undefined reference to `Tpetra::DistObject<float, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::reallocArraysForNumPacketsPerLid(unsigned long, unsigned long)'
ld.bfd: /work2/10103/hx38324/frontera/libs/trilinos-release-14-4-0/lib/libstratimikosbelos.so.14.4: undefined reference to `Tpetra::MultiVector<float, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::need_sync_device() const'
ld.bfd: /work2/10103/hx38324/frontera/libs/trilinos-release-14-4-0/lib/libstratimikosbelos.so.14.4: undefined reference to `Tpetra::DistObject<float, int, long long, Tpetra::KokkosCompat::KokkosDeviceWrapperNode<Kokkos::Serial, Kokkos::HostSpace> >::unpackAndCombine(Kokkos::DualView<int const*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void> const&, Kokkos::DualView<float*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, Kokkos::DualView<unsigned long*, Kokkos::Device<Kokkos::Serial, Kokkos::HostSpace>, void, void>, unsigned long, Tpetra::CombineMode, Kokkos::Serial const&)'

I played around a bit with this myself and it looks like one of the instantiations of TPetra (for float) is missing. I found this line which is active when using MKL (which we do). This looks like it could be the reason for the missing instantiation. Unfortunately I cannot simply add that instantiation, because the original bug in MKL is still there (HAVE_TEUCHOS_BLASFLOAT is false, so trilinos thinks the Intel MKL blas implementation does not support float). Interestingly the candi branch dealii-9.5 compiles without issues, so something must have changed in candi (maybe #375 or #350).

Any pointers for how to resolve this problem would be appreciated.

For now I try working around the problem by using the cluster provided trilinos modules and/or using the old candi version.

cgcgcg commented 1 month ago

This issue was recently reported to Trilinos: https://github.com/trilinos/Trilinos/issues/13456

The problem is that candi disables float scalar type for Tpetra in builds with MKL https://github.com/dealii/candi/blob/b29742545f13c5e61cd6b681932a85e07b25f2a8/deal.II-toolchain/packages/trilinos.package#L150 but enables them for the rest of Trilinos: https://github.com/dealii/candi/blob/b29742545f13c5e61cd6b681932a85e07b25f2a8/deal.II-toolchain/packages/trilinos.package#L241 This exposed an issue with Trilinos' CMake logic that was fixed here: https://github.com/trilinos/Trilinos/pull/13457

The consequence of this change is that Candi's configuration for Trilinos will error out for future versions of Trilinos. The fix for Candi is to set Trilinos_ENABLE_FLOAT=OFF when building with MKL.