[FEA] consider coalescing buffers spilled to host memory/disk

NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs

Apache License 2.0

821 stars 235 forks source link

Is your feature request related to a problem? Please describe. While working on GDS spilling, I noticed that having a large number of shuffle partitions (or even the default 200) may result in many small shuffle buffers. Spilling these to disk via GDS becomes very slow, while coalescing them into large buffers with fewer writes greatly improves performance. Spilling to host memory seems to be less affected by the number of shuffle partitions (thus shuffle buffer size), but it may be worth considering coalescing when spilling to host memory as well.

Describe the solution you'd like Investigate coalescing small shuffle buffers when spilling them to host memory/disk.

Describe alternatives you've considered We may also want to consider coalescing at the spilling layer so the spillable buffers are always combined into larger buffers, regardless of the spilling destination.

Additional context GDS spilling PR: https://github.com/NVIDIA/spark-rapids/pull/2295

@jlowe @abellina @revans2

I worked on a prototype for speedup up spilling to pageable memory (which is a real issue for us) via both a GPU and a pinned bounce buffer. I need to run some verification on and can post numbers here after that.

The approach is kind of expensive and made some assumptions, so here is the high level idea:

Focus on small buffers smaller than a threshold and those who are larger bypass this.
Copy small buffers to device bounce buffer. We'd keep a few of these bounce buffers around. For example, one bounce buffer is working on the small copies D2D while the other is committed to a copy to host.
Perform a big copy to pinned memory bounce buffer of same size, when the device bounce buffer is exhausted (or spill action is complete). This part could be avoided if we find copying to pageable memory is good enough at larger memcpy sizes.
Perform a memcpy from pinned to pageable target. We need a thread at least for this one, since memcpy is synchronous.

So the pageable buffer could either be one large buffer, which means we have references to it and can't free it until all the host buffers that are referencing it get spilled to disk, which may be unacceptable, or individual pageable buffers. I had implemented the larger pageable buffer, but want to try the individual approach.

NVIDIA / spark-rapids

[FEA] consider coalescing buffers spilled to host memory/disk #2336