DARMA-tasking / magistrate

DARMA/magistrate => Serialization and checkpointing library
Other
12 stars 4 forks source link

Serialization for Kokkos containers residing in host-inaccessible memory space with limited cost #196

Open PhilMiller opened 3 years ago

PhilMiller commented 3 years ago

Application codes are moving away from allocating data in memory that's accessible to both host and device code to avoid performance overhead and pitfalls. We still need to be able to serialize instances of device-space containers for checkpoint/restart and messaging/communication.

This will be a concern for any view that doesn't satisfy this predicate:

template <typename ViewType>
constexpr bool isHostAccessible(const ViewType &v) {
  return SpaceAccessibility<HostSpace,ViewType::memory_space>::accessible;
}

The most expedient implementation would be to use auto host_view = Kokkos::create_mirror_view(view_to_serialize); with appropriate copies (or create_mirror_view_and_copy). The problem with this is that it may allocate a lot of memory, and move a lot of data synchronously all at once if view_to_serialize is large. We may need to do that as a stop-gap measure anyway, to guarantee functionality.

A more thorough implementation would create and copy through limited-size bounce buffers in host memory to limit added memory footprint. To ensure good performance, it would use a streaming approach with an exec_space argument to Kokkos::deep_copy, so that parts of the view's contents can be serialized while other parts are being copied.

PhilMiller commented 3 years ago

Serialization to a buffer inherently implies the footprint of the active data being doubled for the serialization buffer. Doing better than that requires some form of direct transfer (e.g. RDMA, GPUDirect MPI, etc).

Any approach based on mirror view construction implies a footprint up to triple the active data.

An incremental / streaming bounce buffer can offer 2x+delta footprint, where delta is constant, but may still have to be substantial to obtain good performance.

Since we effectively assume that the underlying type is byte-copyable through use of deep_copy, we could arrange to construct a View<T***, HostSpace, MemoryUnmanaged> of the right size and full contiguity pointing directly at the desired spot in the serialization buffer, and deep_copy directly to/from that. That may be ideal, if there's nothing stopping us from making it work.

PhilMiller commented 3 years ago

Jonathan mentioned an increment-offset-and-get-pointer method on the serializer that should be good for the in-memory buffer use case. We'll pass it the size of the View contents to be serialized, and subtract that off to form the pointer that will be the base of the unmanaged host view which will be the target of the deep_copy