ECP-copa / ExaMPM

Material point method proxy application based on Cabana.
BSD 3-Clause "New" or "Revised" License
10 stars 13 forks source link

Issues running with MPI #56

Open vsoch opened 11 months ago

vsoch commented 11 months ago

Hi! I'm new to using this app, and was wondering if you have an example for running with mpirun (or similar?) I'm looking at the docs here: https://github.com/ECP-copa/ExaMPM/wiki/Run

Thank you! And apologies if this is an overly simple question (e.g., just put mpirun in front of that :P )

vsoch commented 11 months ago

I'm also trying to build (and running into issues):

include  lib
root@526890c2f022:/opt/Cabana# ls  
AUTHORS       CMakeLists.txt   LICENSE    benchmark  cajita  core    example
CHANGELOG.md  CONTRIBUTING.md  README.md  build      cmake   docker
root@526890c2f022:/opt/Cabana# ls build/
CMakeCache.txt              CabanaConfigVersion.cmake  cmake_install.cmake
CMakeDoxyfile.in            Doxyfile.doxygen           core
CMakeDoxygenDefaults.cmake  Makefile                   example
CMakeFiles                  _deps                      install
CTestTestfile.cmake         bin                        install_manifest.txt
Cabana.pc                   cajita                     lib
CabanaConfig.cmake          cmake
root@526890c2f022:/opt/Cabana# ls build/in`
> ^C
root@526890c2f022:/opt/Cabana# ls build/install
include  lib
root@526890c2f022:/opt/Cabana# ls build        
CMakeCache.txt              CabanaConfigVersion.cmake  cmake_install.cmake
CMakeDoxyfile.in            Doxyfile.doxygen           core
CMakeDoxygenDefaults.cmake  Makefile                   example
CMakeFiles                  _deps                      install
CTestTestfile.cmake         bin                        install_manifest.txt
Cabana.pc                   cajita                     lib
CabanaConfig.cmake          cmake
root@526890c2f022:/opt/Cabana# ls build/install
include  lib
root@526890c2f022:/opt/Cabana# echo $CABANA_INSTALL_DIR/
/opt/Cabana/build/install/
root@526890c2f022:/opt/Cabana# ls build/install/lib/
cmake  libgmock.a  libgmock_main.a  libgtest.a  libgtest_main.a  pkgconfig
root@526890c2f022:/opt/Cabana# ls build/install/include/
CabanaCore_config.hpp         Cajita_GlobalGrid.hpp
Cabana_AoSoA.hpp              Cajita_GlobalGrid_impl.hpp
Cabana_CommunicationPlan.hpp  Cajita_GlobalMesh.hpp
Cabana_Core.hpp               Cajita_Halo.hpp
Cabana_DeepCopy.hpp           Cajita_IndexConversion.hpp
Cabana_Distributor.hpp        Cajita_IndexSpace.hpp
Cabana_ExecutionPolicy.hpp    Cajita_Interpolation.hpp
Cabana_Halo.hpp               Cajita_LocalGrid.hpp
Cabana_LinkedCellList.hpp     Cajita_LocalGrid_impl.hpp
Cabana_MemberTypes.hpp        Cajita_LocalMesh.hpp
Cabana_NeighborList.hpp       Cajita_ManualPartitioner.hpp
Cabana_Parallel.hpp           Cajita_MpiTraits.hpp
Cabana_ParameterPack.hpp      Cajita_Parallel.hpp
Cabana_Slice.hpp              Cajita_ParticleGridDistributor.hpp
Cabana_SoA.hpp                Cajita_Partitioner.hpp
Cabana_Sort.hpp               Cajita_ReferenceStructuredSolver.hpp
Cabana_Tuple.hpp              Cajita_SparseDimPartitioner.hpp
Cabana_Types.hpp              Cajita_SparseIndexSpace.hpp
Cabana_VerletList.hpp         Cajita_Splines.hpp
Cabana_Version.hpp            Cajita_Types.hpp
Cajita.hpp                    Cajita_UniformDimPartitioner.hpp
Cajita_Array.hpp              gmock
Cajita_BovWriter.hpp          gtest
Cajita_Config.hpp             impl

root@526890c2f022:/opt/exaMPM/build#     cmake -D CMAKE_BUILD_TYPE="Release" \
      -D CMAKE_PREFIX_PATH=$CABANA_INSTALL_DIR \
      -D CMAKE_INSTALL_PREFIX=install .. && \
    make install
-- The CXX compiler identification is GNU 11.4.0
-- The C compiler identification is GNU 11.4.0
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Enabled Kokkos devices: OPENMP;SERIAL
-- Found MPI_CXX: /usr/lib/x86_64-linux-gnu/libmpichcxx.so (found version "4.0") 
-- Found MPI: TRUE (found version "4.0") found components: CXX 
-- Found CLANG_FORMAT: /usr/local/bin/clang-format (found suitable version "17.0.5", minimum required is "14") 
-- Configuring done
CMake Error at src/CMakeLists.txt:20 (add_library):
  Target "exampm" links to target "Cabana::Core" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?

CMake Error at src/CMakeLists.txt:20 (add_library):
  Target "exampm" links to target "Cabana::Grid" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?

CMake Error at examples/CMakeLists.txt:1 (add_executable):
  Target "DamBreak" links to target "Cabana::Core" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?

CMake Error at examples/CMakeLists.txt:1 (add_executable):
  Target "DamBreak" links to target "Cabana::Grid" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?

CMake Error at examples/CMakeLists.txt:4 (add_executable):
  Target "FreeFall" links to target "Cabana::Core" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?

CMake Error at examples/CMakeLists.txt:4 (add_executable):
  Target "FreeFall" links to target "Cabana::Grid" but the target was not
  found.  Perhaps a find_package() call is missing for an IMPORTED target, or
  an ALIAS target is missing?

-- Generating done
CMake Generate step failed.  Build files cannot be regenerated correctly.

My cabana install directory:

# ls $CABANA_INSTALL_DIR/
include  lib
root@526890c2f022:/opt/Cabana# ls  
AUTHORS       CMakeLists.txt   LICENSE    benchmark  cajita  core    example
CHANGELOG.md  CONTRIBUTING.md  README.md  build      cmake   docker
root@526890c2f022:/opt/Cabana# ls build/
CMakeCache.txt              CabanaConfigVersion.cmake  cmake_install.cmake
CMakeDoxyfile.in            Doxyfile.doxygen           core
CMakeDoxygenDefaults.cmake  Makefile                   example
CMakeFiles                  _deps                      install
CTestTestfile.cmake         bin                        install_manifest.txt
Cabana.pc                   cajita                     lib
CabanaConfig.cmake          cmake

root@526890c2f022:/opt/Cabana# ls build/install
include  lib

root@526890c2f022:/opt/Cabana# ls build        
CMakeCache.txt              CabanaConfigVersion.cmake  cmake_install.cmake
CMakeDoxyfile.in            Doxyfile.doxygen           core
CMakeDoxygenDefaults.cmake  Makefile                   example
CMakeFiles                  _deps                      install
CTestTestfile.cmake         bin                        install_manifest.txt
Cabana.pc                   cajita                     lib
CabanaConfig.cmake          cmake

root@526890c2f022:/opt/Cabana# ls build/install
include  lib

root@526890c2f022:/opt/Cabana# echo $CABANA_INSTALL_DIR/
/opt/Cabana/build/install/

root@526890c2f022:/opt/Cabana# ls build/install/lib/
cmake  libgmock.a  libgmock_main.a  libgtest.a  libgtest_main.a  pkgconfig

root@526890c2f022:/opt/Cabana# ls build/install/include/
CabanaCore_config.hpp         Cajita_GlobalGrid.hpp
Cabana_AoSoA.hpp              Cajita_GlobalGrid_impl.hpp
Cabana_CommunicationPlan.hpp  Cajita_GlobalMesh.hpp
Cabana_Core.hpp               Cajita_Halo.hpp
Cabana_DeepCopy.hpp           Cajita_IndexConversion.hpp
Cabana_Distributor.hpp        Cajita_IndexSpace.hpp
Cabana_ExecutionPolicy.hpp    Cajita_Interpolation.hpp
Cabana_Halo.hpp               Cajita_LocalGrid.hpp
Cabana_LinkedCellList.hpp     Cajita_LocalGrid_impl.hpp
Cabana_MemberTypes.hpp        Cajita_LocalMesh.hpp
Cabana_NeighborList.hpp       Cajita_ManualPartitioner.hpp
Cabana_Parallel.hpp           Cajita_MpiTraits.hpp
Cabana_ParameterPack.hpp      Cajita_Parallel.hpp
Cabana_Slice.hpp              Cajita_ParticleGridDistributor.hpp
Cabana_SoA.hpp                Cajita_Partitioner.hpp
Cabana_Sort.hpp               Cajita_ReferenceStructuredSolver.hpp
Cabana_Tuple.hpp              Cajita_SparseDimPartitioner.hpp
Cabana_Types.hpp              Cajita_SparseIndexSpace.hpp
Cabana_VerletList.hpp         Cajita_Splines.hpp
Cabana_Version.hpp            Cajita_Types.hpp
Cajita.hpp                    Cajita_UniformDimPartitioner.hpp
Cajita_Array.hpp              gmock
Cajita_BovWriter.hpp          gtest
Cajita_Config.hpp             impl

Did I forget to build something?

streeve commented 11 months ago

For the build issues, you just need a newer version of Cabana (0.6.1 or up to date master). I just opened a PR to give a clear error at configuration

For the run, I went ahead and updated the wiki page - exactly as you guessed, you can just add mpirun

vsoch commented 11 months ago

Thank you! A quick follow up question (I'm not great at debugging MPI). I can confirm that I can ping the other host and can ssh into it from my launcher, but I'm getting an error. Here are details:

Here are my hosts

# cat hostlist.txt 
metricset-sample-l-0-0.ms.default.svc.cluster.local
metricset-sample-w-0-0.ms.default.svc.cluster.local

Ping works to the worker (w) node

# ping metricset-sample-w-0-0.ms.default.svc.cluster.local
PING metricset-sample-w-0-0.ms.default.svc.cluster.local (10.244.0.16) 56(84) bytes of data.
64 bytes from metricset-sample-w-0-0.ms.default.svc.cluster.local (10.244.0.16): icmp_seq=1 ttl=63 time=0.097 ms
64 bytes from metricset-sample-w-0-0.ms.default.svc.cluster.local (10.244.0.16): icmp_seq=2 ttl=63 time=0.058 ms
64 bytes from metricset-sample-w-0-0.ms.default.svc.cluster.local (10.244.0.16): icmp_seq=3 ttl=63 time=0.050 ms
^C

mpirun spits out this error

# mpirun --hostfile ./hostlist.txt --allow-run-as-root -N 2 ./DamBreak 0.05 2 0 0.001 1.0 50 serial
ssh: Could not resolve hostname metricset-sample-w-0-0: Temporary failure in name resolution
--------------------------------------------------------------------------
ORTE was unable to reliably start one or more daemons.
This usually is caused by:

* not finding the required libraries and/or binaries on
  one or more nodes. Please check your PATH and LD_LIBRARY_PATH
  settings, or configure OMPI with --enable-orterun-prefix-by-default

* lack of authority to execute on one or more specified nodes.
  Please verify your allocation and authorities.

* the inability to write startup files into /tmp (--tmpdir/orte_tmpdir_base).
  Please check with your sys admin to determine the correct location to use.

*  compilation of the orted with dynamic libraries when static are required
  (e.g., on Cray). Please check your configure cmd line and consider using
  one of the contrib/platform definitions for your system type.

* an inability to create a connection back to mpirun due to a
  lack of common network interfaces and/or no route found between
  them. Please check network connectivity (including firewalls
  and network routing requirements).
--------------------------------------------------------------------------

ssh to the other host works too!

root@metricset-sample-l-0-0:/opt/exaMPM/build/examples# ssh metricset-sample-w-0-0.ms.default.svc.cluster.local
Welcome to Ubuntu 22.04.3 LTS (GNU/Linux 6.2.0-37-generic x86_64)

 * Documentation:  https://help.ubuntu.com
 * Management:     https://landscape.canonical.com
 * Support:        https://ubuntu.com/advantage

This system has been minimized by removing packages and content that are
not required on a system that users do not log into.

To restore this content, you can run the 'unminimize' command.

The programs included with the Ubuntu system are free software;
the exact distribution terms for each program are described in the
individual files in /usr/share/doc/*/copyright.

Ubuntu comes with ABSOLUTELY NO WARRANTY, to the extent permitted by
applicable law.

The MPI I'm using (maybe the wrong one or version?) Thanks for the help!

mpirun (Open MPI) 4.1.2

Usage: mpirun [OPTION]...  [PROGRAM]...
Start the given program using Open RTE

Thanks for your help!

streeve commented 11 months ago

Looks to be an MPI configuration issue (that I don't think I can help with), but I can at least confirm that the version of MPI is something we test against regularly

vsoch commented 11 months ago

This is probably my stopping point for working on it then - I'm not sure what the problem above is (and I'm still inexperienced with MPI). For context I was going to add it to the metrics operator https://github.com/converged-computing/metrics-operator and use for converged computing experiments on Kubernetes, but I'll skip over it and move on to the next. Thanks!

streeve commented 10 months ago

Now that I see you have a lammps case, maybe using exactly the same MPI call as what they use would make a difference here? Just a thought

vsoch commented 10 months ago

Thanks for the suggestion! The lammps metrics container uses mpich and the one here is openmpi, so I don't think we could do that. I did try mpich too (with the same command) and got a non-working result.