NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
822 stars 235 forks source link

[FEA] GpuOutOfCoreSortIterator supports split-and-retry for GPU OOM #11603

Open firestarman opened 1 month ago

firestarman commented 1 month ago

We met the OOM error as below in some customer queries, which led to the Spark task retries.

Caused by: com.nvidia.spark.rapids.jni.GpuSplitAndRetryOOM: GPU OutOfMemory: could not split inputs and retry
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:458)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:588)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:517)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.drainSingleWithVerification(RmmRapidsRetryIterator.scala:291)
at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.withRetryNoSplit(RmmRapidsRetryIterator.scala:132)
at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.mergeSortEnoughToOutput(GpuSortExec.scala:537)
at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$next$5(GpuSortExec.scala:610)

In function mergeSortEnoughToOutput, it supports retry for OOM but without splitting. We can try to implement the split-and-retry to improve its stability.

In the function mergeSortEnoughToOutput, it will first collect the small sorted batches with total size up to the target size, next do the merge-sort to get a single sorted table. Then return the single table or split it for the next round. The OOM can happen just when splitting the sorted big table.

The idea is simple. Instread of splitting the sorted big table (it will probably fail since it requires the same size GPU memory as the splitting), we just retry the whole process again with smaller target size by starting with collecting the small sorted batches with smaller target size. (target_size/2), executing the merge-sort and splitting the sorted table. In the retry, we are handling a much smaller batch and it requires less GPU memory.

We can do similar split-retry in concatOutput.

firestarman commented 1 month ago

similar as https://github.com/NVIDIA/spark-rapids/issues/11584

revans2 commented 1 month ago

I don't think what you are asking is reasonable. Nor do I think it is going to fix your issue. There is some other problem that is causing memory issues that we need to track down.

When doing an out of core sort the algorithm will

1) read in a batch of data 2) sort it 3) split it into smaller batches (approximately 1/16th the size of a target batch) 4) save the start and end rows so we can sort these batches.

Then once all of the input has been read in the batches are sorted by their starting rows and 1 batch size worth of data is read merge sorted. Then that data is split up into a few different sections. A section that is fully sorted and can be released, and a section that is not fully sorted yet. The second section is then broken down into smaller batches and the process repeats.

The retry that is throwing an exception saying it is totally out of memory and cannot do anything more is in the part that is after the merge sort has finished and it is trying to split the data into smaller batches/release the output data.

First off we are already splitting the data, so I don't know how to save memory by splitting that data further just so we can keep track of it and try and split it again.

Second the amount of memory that we should be working on should be very close to 1 target batch size. The only way this is not going to be the case is if there is extreme skew in the size of the rows in the data and we get that split wrong, or if there is something else going on that is really bad.

Can we please treat this situation as a bug instead of a feature? I would really love to see what size the batches are, and what else is happening on other threads. Are they getting rolled back or what?

firestarman commented 1 month ago

I don't think what you are asking is reasonable. Nor do I think it is going to fix your issue. There is some other problem that is causing memory issues that we need to track down.

Thx for the info. Very helpful. Updated the description for a better understanding on what I am asking here. Actually i have made a local change here and am preparing the formal PR.

In the function mergeSortEnoughToOutput, it will first collect the small sorted batches (as told in NO. 3 in your comment above) with total size up to the target size, next do the merge-sort to get a single sorted table. Then return the single table or split it for the next round. The OOM happened just when splitting the sorted big table.

My idea is simple. Instread of splitting the sorted big table (it will probably fail since it requires the same size GPU memory as the splitting), we just retry the whole process again with smaller target size by starting with collecting the small sorted batches with smaller target size. (target_size/2), executing the merge-sort and splitting the sorted table. In the retry, we are handling a much smaller batch and it reuqires less GPU memory.

We can also do similar split-retry in concatOutput. It is for stability, not performance.

revans2 commented 1 month ago

I get what you want to do. What I am saying is that I think that it is hiding other problems. Lets take an example here.

I have 5 GiB of data to sort and a target batch size of 1 GiB. I read in, sort, and split the 5 GiB and end up with 40 smaller batches to try and merge sort. I pick the 8 batches with the lowest row in them. Those 8 batches should add up to almost exactly 1 GiB of data. So I merge sort them, which in the best case will require 1 GiB of input working memory, 1 GiB of output working memory and some amount of data for the gather maps, or something similar. So about 2 GiB of data. After I merge sort the data we release the original input (at least enough that it can spill again) and we are back at 1 GiB of max memory being used. Then I do some upper/lower bound calculations and figure out where to split the data. Then we split the data based on those bounds. So again I need 1 GiB of data for the input 1 GiB of data for the output and some small amount of data to say where the cutoffs are.

The maximum amount of memory being used here is a little over 2x the target batch size, and it should not be more than the memory needed to merge sort the inputs. If that is all true, then why are we running out of memory in the split?

There are a few possibilities and I want to understand what is happening.

  1. Another thread came in and grabbed more memory on the GPU in between the merge and the split operations. But if that is true the spill framework should have paused one of the threads and had it do a retry (not a split and retry). Split and retry should only happen if there is a single thread running and it is totally out of memory. So if that is the case then we have a bug in the spill framework somewhere, or some kind of a memory leak.
  2. The input batch is actually much larger than the 1 GiB we think it is and for some reason the merge sort is more memory efficient than the split is, so we happen to fail on the split. But we pull in batches without going over the target batch size, except in the case when we have to have at least two batches to merge sort. Even if that is happening here on a 16 GiB GPU we should still be able to get close to 7 GiB before we risk running out of GPU memory, and I find it really hard to believe that we would have enough input skew to cause this to happen.

I cannot think of any other situation where we could run out of memory, and arguably all of them are the result of a bug in our code. So I want to understand what is actually happening before we try and work around the issue.

firestarman commented 1 month ago

Got it. @winningsix We need to try to collect more information when getting the GPU OOM to understand what is happening.