Open rongou opened 3 years ago
I worked on a prototype for speedup up spilling to pageable memory (which is a real issue for us) via both a GPU and a pinned bounce buffer. I need to run some verification on and can post numbers here after that.
The approach is kind of expensive and made some assumptions, so here is the high level idea:
memcpy
is synchronous.So the pageable buffer could either be one large buffer, which means we have references to it and can't free it until all the host buffers that are referencing it get spilled to disk, which may be unacceptable, or individual pageable buffers. I had implemented the larger pageable buffer, but want to try the individual approach.
Is your feature request related to a problem? Please describe. While working on GDS spilling, I noticed that having a large number of shuffle partitions (or even the default 200) may result in many small shuffle buffers. Spilling these to disk via GDS becomes very slow, while coalescing them into large buffers with fewer writes greatly improves performance. Spilling to host memory seems to be less affected by the number of shuffle partitions (thus shuffle buffer size), but it may be worth considering coalescing when spilling to host memory as well.
Describe the solution you'd like Investigate coalescing small shuffle buffers when spilling them to host memory/disk.
Describe alternatives you've considered We may also want to consider coalescing at the spilling layer so the spillable buffers are always combined into larger buffers, regardless of the spilling destination.
Additional context GDS spilling PR: https://github.com/NVIDIA/spark-rapids/pull/2295
@jlowe @abellina @revans2