cb-geo / mpm

CB-Geo High-Performance Material Point Method
https://www.cb-geo.com/research/mpm
Other
244 stars 82 forks source link

Refactor MPI serialization #689

Closed kks32 closed 4 years ago

kks32 commented 4 years ago

MPM Particle serialization

Summary

Add functionality to handle particle serialization and deserialization to transfer particles across MPI tasks.

Motivation

The existing design uses Plain-Old-Data (POD) structure, where Particle class writes its data to HDF5Particle with all the relevant information. This POD is then serialized and sent using MPI with MPI_Type_Create_Struct. This requires registering all the different particle types and makes it harder to implement when more than 1 particle type is involved. The serialization/deserialization function offers a unified interface to transfer particles.

Design Detail

The Particle class will have a serialize and a deserialize function both using a vector<uint8_t> as the buffer. In addition, we need to compute the pack size to initialize the buffer with the correct size. This is saved as a private variable.

//! Serialize particle data
template <unsigned Tdim>
std::vector<uint8_t> mpm::Particle<Tdim>::serialize() {
  // Compute pack size
  if (pack_size_ == 0) pack_size_ = compute_pack_size();
  // Initialize data buffer
  std::vector<uint8_t> data;
  data.resize(pack_size_);
  uint8_t* data_ptr = &data[0];
  int position = 0;

#ifdef USE_MPI
  // Type
  int type = ParticleType.at(this->type());
  MPI_Pack(&type, 1, MPI_INT, data_ptr, data.size(), &position, MPI_COMM_WORLD);

  // Material id
  unsigned nmaterials = material_id_.size();
  MPI_Pack(&nmaterials, 1, MPI_UNSIGNED, data_ptr, data.size(), &position,
           MPI_COMM_WORLD);
  MPI_Pack(&material_id_[0], 1, MPI_UNSIGNED, data_ptr, data.size(), &position,
           MPI_COMM_WORLD);

  // ID
  MPI_Pack(&id_, 1, MPI_UNSIGNED_LONG_LONG, data_ptr, data.size(), &position,
           MPI_COMM_WORLD);
  // Mass
  MPI_Pack(&mass_, 1, MPI_DOUBLE, data_ptr, data.size(), &position,
           MPI_COMM_WORLD);
  // Volume
  MPI_Pack(&volume_, 1, MPI_DOUBLE, data_ptr, data.size(), &position,
           MPI_COMM_WORLD);
#endif
} 

The deserialization function will read from the buffer.

//! Deserialize particle data
template <unsigned Tdim>
void mpm::Particle<Tdim>::deserialize(
    const std::vector<uint8_t>& data,
    std::vector<std::shared_ptr<mpm::Material<Tdim>>>& materials) {
  uint8_t* data_ptr = const_cast<uint8_t*>(&data[0]);
  int position = 0;

#ifdef USE_MPI
  // Type
  int type = ParticleType.at(this->type());
  MPI_Unpack(data_ptr, data.size(), &position, &type, 1, MPI_INT,
             MPI_COMM_WORLD);
  // material id
  int nmaterials = 0;
  MPI_Unpack(data_ptr, data.size(), &position, &nmaterials, 1, MPI_UNSIGNED,
             MPI_COMM_WORLD);

  MPI_Unpack(data_ptr, data.size(), &position, &material_id_[0], 1,
             MPI_UNSIGNED, MPI_COMM_WORLD);

  // ID
  MPI_Unpack(data_ptr, data.size(), &position, &id_, 1, MPI_UNSIGNED_LONG_LONG,
             MPI_COMM_WORLD);
  // mass
  MPI_Unpack(data_ptr, data.size(), &position, &mass_, 1, MPI_DOUBLE,
             MPI_COMM_WORLD);
  // volume
  MPI_Unpack(data_ptr, data.size(), &position, &volume_, 1, MPI_DOUBLE,
             MPI_COMM_WORLD);

#endif
}

Important consideration: We expect all future derivation of particle types to have the first few bytes to be the Type of particle followed by material information to retrieve in the mesh class for initialization of particle and subsequent deserialization.

In addition, the particle type is added to the Particle class.

  //! Type of particle
  std::string type() const override { return (Tdim == 2) ? "P2D" : "P3D"; }

This is used to identify the type of particle and create them when they are transferred across MPI tasks. Moreover, we have added ParticleType and ParticleTypeString as global maps to determine an index value (int) mapped to a string "P2D". The reason for this is that in serialization if we use string, we have no idea of the length of the string, which makes it complicated. Instead, since we are only going to have a few different particle types, it's easier to set-up a map to do a quick lookup.

>particle.cc
namespace mpm {
// ParticleType
std::map<std::string, int> ParticleType = {{"P2D", 0}, {"P3D", 1}};
std::map<int, std::string> ParticleTypeName = {{0, "P2D"}, {1, "P3D"}};
}  // namespace mpm

The MPI transfer_halo_particles will be altered to send 1 particle at a time rather than a bulk of particles. This is to achieve sending different particle types in a cell all at once (sequentially), instead of iterating through each particle type.

These changes would remove the need for registering MPI particle types and also get us one more step closer to removing the limit of 20 on the state_vars.

Drawbacks

No potential drawback has been identified.

Rationale and Alternatives

Why is this design the best in the space of possible designs?

Serialization vs MPI_Type_Create_Struct speed is unknown, we may have to do a performance benchmark to see the difference. Using Struct data types means we have to register each particle type and has a fixed number of state variables.

What other designs have been considered and what is the rationale for not choosing them?

Different serialization libraries were considered: (Boost Serialization)[https://www.boost.org/doc/libs/1_56_0/libs/serialization/doc/tutorial.html], (Cereal)[http://uscilab.github.io/cereal/], and (bitsery)[https://github.com/fraillt/bitsery]. The fastest bitsery, doesn't have serialization support for Eigen, we can implement a custom serializer but it will take some time. The MPI Pack/Unpack seems to be one of the fastest

Size

Size

Time

Time

What is the impact of not doing this?

If not done, will result in clunkier interface for handling different MPI transfer.

Prior Art

Discuss prior art, both the good and the bad, in relation to this proposal. A few examples of what this can include are:

https://github.com/STEllAR-GROUP/cpp-serializers

https://github.com/fraillt/bitsery#why-use-bitsery

Unresolved questions

What parts of the design do you expect to resolve through the RFC process before this gets merged?

MPI transfer halo particles function is yet to be implemented. Don't foresee an issue, but still TBD.

Related issues

https://github.com/cb-geo/mpm/pull/680 https://github.com/cb-geo/mpm/pull/681

Changelog

bodhinandach commented 4 years ago

@kks32 That would be nice to see some performance comparison of using the serialize and deserialize vs the normal hdf5, just to make sure there is no performance reduction. Also, can we check it for different numbers of MPI rank?

kks32 commented 4 years ago

We won't have a big difference in the amount of information that is being sent/received. Furthermore, it would be hard to measure any significant speed difference in MPI transfer unless we do 100s of nodes with millions of particles, even if we do that I don't think it will be a big difference considering since the data size change is very small. However, as previously mentioned in the RFC, the time to serialize/deserialize particles as PODs or vector of uint_8t is benchmarked and the results show the serialization with unit_8t is faster than POD+MPI_Type_Create_Struct. Serialization / Deserialization of a POD in itself is faster, however, registering and deregistering the MPI Data types more time than serializating/deserializing as vector of unsigned buffer.

image

SECTION("Performance benchmarks") {
      // Number of iterations
      unsigned niterations = 1000;

      // Serialization benchmarks
      auto serialize_start = std::chrono::steady_clock::now();
      for (unsigned i = 0; i < niterations; ++i) {
        // Serialize particle
        auto buffer = particle->serialize();
        // Deserialize particle
        std::shared_ptr<mpm::ParticleBase<Dim>> rparticle =
            std::make_shared<mpm::Particle<Dim>>(id, pcoords);

        REQUIRE_NOTHROW(rparticle->deserialize(buffer, materials));
      }
      auto serialize_end = std::chrono::steady_clock::now();

      // HDF5 serialization
      auto hdf5_start = std::chrono::steady_clock::now();
      for (unsigned i = 0; i < niterations; ++i) {
        // Serialize particle as POD
        auto hdf5 = particle->hdf5();
        // Deserialize particle with POD
        std::shared_ptr<mpm::ParticleBase<Dim>> rparticle =
            std::make_shared<mpm::Particle<Dim>>(id, pcoords);
        // Initialize MPI datatypes
        MPI_Datatype particle_type = mpm::register_mpi_particle_type(hdf5);
        REQUIRE_NOTHROW(rparticle->initialise_particle(hdf5, material));
        mpm::deregister_mpi_particle_type(particle_type);
      }
      auto hdf5_end = std::chrono::steady_clock::now();
}
codecov[bot] commented 4 years ago

Codecov Report

Merging #689 into develop will decrease coverage by 0.11%. The diff coverage is 67.70%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #689      +/-   ##
===========================================
- Coverage    96.81%   96.69%   -0.11%     
===========================================
  Files          131      130       -1     
  Lines        25811    25822      +11     
===========================================
- Hits         24987    24968      -19     
- Misses         824      854      +30     
Impacted Files Coverage Δ
include/mesh.h 100.00% <ø> (ø)
include/mesh.tcc 82.65% <0.00%> (-1.48%) :arrow_down:
include/particles/particle_base.h 100.00% <ø> (ø)
tests/graph_test.cc 100.00% <ø> (ø)
include/particles/particle.tcc 91.92% <82.69%> (-2.01%) :arrow_down:
include/particles/particle.h 100.00% <100.00%> (ø)
include/solvers/mpm_explicit.tcc 95.16% <100.00%> (+0.08%) :arrow_up:
tests/particle_serialize_deserialize_test.cc 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 1eaa7a4...6c1aef8. Read the comment docs.

kks32 commented 4 years ago

The pack/unpack serialization in this PR is faster than the POD struct implementation for 2D sliding block with 4 MPI ranks. The results are an average of 5 different runs.

Schemes Avg Times (ms) SD (ms)
Pack/Unpack 13201 326
POD/Struct 13815 540
kks32 commented 4 years ago

@bodhinandach or @tianchiTJ or @jgiven100 would you be able to test the MPI scheme with a material model that has state variables (NorSand or MC)? Check with load balancing or any problem that involves migration of particles.

tianchiTJ commented 4 years ago

@bodhinandach or @tianchiTJ or @jgiven100 would you be able to test the MPI scheme with a material model that has state variables (NorSand or MC)? Check with load balancing or any problem that involves migration of particles.

I test it by MC model, and I think the result is good.

jgiven100 commented 4 years ago

@kks32 NorSand test looks good

kks32 commented 4 years ago

Thanks @jgiven100 and @tianchiTJ for testing with state vars materials

ezrayst commented 4 years ago

@kks32, I would like to understand the data being presented.

The pack/unpack serialization in this PR is faster than the POD struct implementation for 2D sliding block with 4 MPI ranks. The results are an average of 5 different runs.

Schemes Avg Times (ms) SD (ms) Pack/Unpack 13201 326 POD/Struct 13815 540

What is SD here? Previously you showed POD has 0.4 to 0.7 speedup as compared to Pack/Unpack but why is this result shows that POD takes longer? (I think I am missing something here, sorry).

kks32 commented 4 years ago

@kks32, I would like to understand the data being presented.

The pack/unpack serialization in this PR is faster than the POD struct implementation for 2D sliding block with 4 MPI ranks. The results are an average of 5 different runs. Schemes Avg Times (ms) SD (ms) Pack/Unpack 13201 326 POD/Struct 13815 540

What is SD here? Previously you showed POD has 0.4 to 0.7 speedup as compared to Pack/Unpack but why is this result shows that POD takes longer? (I think I am missing something here, sorry).

POD alone is insufficient as you need to register data with MPI_Type_Create_Struct. This adds additional run-time. Compared to our current implementation on develop, the Pack/Unpack is slightly faster. This is the best way to handle different particle types.