Open PhilMiller opened 3 years ago
Serialization to a buffer inherently implies the footprint of the active data being doubled for the serialization buffer. Doing better than that requires some form of direct transfer (e.g. RDMA, GPUDirect MPI, etc).
Any approach based on mirror view construction implies a footprint up to triple the active data.
An incremental / streaming bounce buffer can offer 2x+delta
footprint, where delta is constant, but may still have to be substantial to obtain good performance.
Since we effectively assume that the underlying type is byte-copyable through use of deep_copy
, we could arrange to construct a View<T***, HostSpace, MemoryUnmanaged>
of the right size and full contiguity pointing directly at the desired spot in the serialization buffer, and deep_copy
directly to/from that. That may be ideal, if there's nothing stopping us from making it work.
Jonathan mentioned an increment-offset-and-get-pointer method on the serializer that should be good for the in-memory buffer use case. We'll pass it the size of the View
contents to be serialized, and subtract that off to form the pointer that will be the base of the unmanaged host view which will be the target of the deep_copy
Application codes are moving away from allocating data in memory that's accessible to both host and device code to avoid performance overhead and pitfalls. We still need to be able to serialize instances of device-space containers for checkpoint/restart and messaging/communication.
This will be a concern for any view that doesn't satisfy this predicate:
The most expedient implementation would be to use
auto host_view = Kokkos::create_mirror_view(view_to_serialize);
with appropriate copies (orcreate_mirror_view_and_copy
). The problem with this is that it may allocate a lot of memory, and move a lot of data synchronously all at once ifview_to_serialize
is large. We may need to do that as a stop-gap measure anyway, to guarantee functionality.A more thorough implementation would create and copy through limited-size bounce buffers in host memory to limit added memory footprint. To ensure good performance, it would use a streaming approach with an
exec_space
argument toKokkos::deep_copy
, so that parts of the view's contents can be serialized while other parts are being copied.