KratosMultiphysics / Kratos

Kratos Multiphysics (A.K.A Kratos) is a framework for building parallel multi-disciplinary simulation software. Modularity, extensibility and HPC are the main objectives. Kratos has BSD license and is written in C++ with extensive Python interface.
https://kratosmultiphysics.github.io/Kratos/
Other
1.01k stars 244 forks source link

[HDF5] Replace Raw MPI Calls #10728

Closed matekelemen closed 1 year ago

matekelemen commented 1 year ago

Description

The HDF5Application has to do MPI synchronization in its internals (obviously), but the C++ side always invokes raw MPI calls on MPI_COMM_WORLD instead of relying on DataCommunicator. This will lead to hangs and other unwanted behaviour if proper MPI communication (with subcommunicators) ever gets implemeted in core.

This is not an urgent problem and I definitely don't have time to do it because it involves extensive changes to the HDF5Application, but it will have to be taken care of sooner or later.

ToDo

philbucher commented 1 year ago

subcomms are available for years, they are just rarely used (in fact I think I was the only one really using it so far)

Most apps support this already, the HdfApp is one of the last ones that need quite some work I.e. it needs to be split into base app and MPI-extension. Then the usage of the datacomm will kinda come naturally

For now the hdfApp is not usable with MPI_Comms != MPI_COMM_WORLD, which is not a common usecase (I used it for a huge FSI simulation where the Structure was using less MPI cores than the fluid)

matekelemen commented 1 year ago

subcomms are available for years, they are just rarely used

I had no idea, thanks for letting me know!

the HdfApp is one of the last ones that need quite some work I.e. it needs to be split into base app and MPI-extension.

Why does it need separation? Is it to make MPI libraries optional when compiling?

philbucher commented 1 year ago

Why does it need separation? Is it to make MPI libraries optional when compiling?

imagine you compile with MPI support. I.e. mpi is now a hard dependency, even if you run in serial. Now you deploy on a system that does not have the MPI libs available, then it will crash

The Altair guys have had this issue afaik.

Thats the reason MPI is handled in the core the way it is currently done

mpentek commented 1 year ago

We talked with @philbucher. One very relevant use-case: CoSim, for example CFD + CSD. Typically, CFD will run on M number of MPI processes, CSD could run on N (<M) number of processes. That would be a balanced use of parallel computing resources. Correspondingly, one would like to trigger H5 output from CFD on M cores, H5 output from CSD on N cores. This is not possible, as currently the simulation hangs.

@matekelemen @sunethwarna this could be relevant for Multiphysics with CoSim in the future, if not already.

I understand it is linked to #11473

sunethwarna commented 1 year ago

@mpentek As you mentioned, this is planned to be finished with the chain of PRs starting with #11473.

philbucher commented 1 year ago

@mpentek can you give it a try? From what I can see, the necessary updates to hdf were made

thanks @sunethwarna !

sunethwarna commented 1 year ago

But @mpentek has to wait because the python side is still using the default communicator. That is yet to be updated. :/