NVIDIA / spark-rapids

Spark RAPIDS plugin - accelerate Apache Spark with GPUs
https://nvidia.github.io/spark-rapids
Apache License 2.0
806 stars 234 forks source link

[BUG] the setting of gpu memory allocation fraction is counter-intuitive #1533

Closed wjxiz1992 closed 3 years ago

wjxiz1992 commented 3 years ago

Describe the bug I was running an LHA query on my PC in standalone mode. When set spark.rapids.memory.gpu.allocFraction=0.9, there's error of cudaErrorMemoryAllocation: out of memory:

21/01/15 18:40:27 WARN TaskSetManager: Lost task 8.0 in stage 3.0 (TID 11, 10.19.183.124, executor 0): ai.rapids.cudf.CudfException: parallel_for failed: cudaErrorMemoryAllocation: out of memory
    at ai.rapids.cudf.ColumnView.matchesRe(Native Method)
    at ai.rapids.cudf.ColumnView.matchesRe(ColumnView.java:2198)
    at com.nvidia.spark.rapids.GpuCast.$anonfun$doColumnar$39(GpuCast.scala:344)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuUnaryExpression.withResource(GpuExpressions.scala:109)
    at com.nvidia.spark.rapids.GpuCast.doColumnar(GpuCast.scala:326)
    at com.nvidia.spark.rapids.GpuUnaryExpression.doItColumnar(GpuExpressions.scala:115)
    at com.nvidia.spark.rapids.GpuUnaryExpression.columnarEval(GpuExpressions.scala:129)
    at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
    at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval(GpuExpressions.scala:162)
    at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval$(GpuExpressions.scala:158)
    at com.nvidia.spark.rapids.CudfBinaryOperator.columnarEval(GpuExpressions.scala:236)
    at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
    at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval(GpuExpressions.scala:162)
    at com.nvidia.spark.rapids.GpuBinaryExpression.columnarEval$(GpuExpressions.scala:158)
    at com.nvidia.spark.rapids.CudfBinaryOperator.columnarEval(GpuExpressions.scala:236)
    at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
    at com.nvidia.spark.rapids.GpuConditionalExpression.computePredicate(conditionalExpressions.scala:31)
    at com.nvidia.spark.rapids.GpuConditionalExpression.computeIfElse(conditionalExpressions.scala:82)
    at com.nvidia.spark.rapids.GpuConditionalExpression.$anonfun$computeIfElse$5(conditionalExpressions.scala:115)
    at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28)
    at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)
    at com.nvidia.spark.rapids.GpuConditionalExpression.withResource(conditionalExpressions.scala:27)
    at com.nvidia.spark.rapids.GpuConditionalExpression.computeIfElse(conditionalExpressions.scala:114)
    at com.nvidia.spark.rapids.GpuIf.columnarEval(conditionalExpressions.scala:154)
    at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
    at com.nvidia.spark.rapids.GpuAlias.columnarEval(namedExpressions.scala:93)
    at com.nvidia.spark.rapids.RapidsPluginImplicits$ReallyAGpuExpression.columnarEval(implicits.scala:35)
    at com.nvidia.spark.rapids.GpuHashAggregateExec.$anonfun$processIncomingBatch$1(aggregate.scala:596)
...

But set spark.rapids.memory.gpu.allocFraction=0.8, the error went away.

Steps/Code to reproduce bug It's an LHA query so I only describe the main steps here:

  1. put LHA dummy dataset on my PC
  2. get query content: query-string
  3. run spark.sql(query-string)

Expected behavior A query should run successfully with a higher memory allocation fraction if it is able to pass with a lower number.

Environment details (please complete the following information)

Set path for source table data

T2_PATH=/home/allxu/tmp/q4/t2 T3_PATH=/home/allxu/tmp/q4/t3 T4_PATH=/home/allxu/tmp/q4/t4

Set path for your query output

Here your query result(dataset) will be saved to the 'output' folder under your current working directory

OUT=output

Set your Spark master IP:Port

MASTER=*****

Remove existed query result data first

Or you cannot run this query successfully

rm -rf $OUT

$SPARK_HOME/bin/spark-submit --master $MASTER \ --driver-memory ${DRIVE_MEMORY}G \ --executor-memory ${EXECUTOR_MEMORY}G \ --executor-cores $EXECUTOR_CORES \ --num-executors $NUM_EXECUTOR \ --conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR} \ --conf spark.driver.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR} \ --conf spark.driver.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true \ --conf spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true -Dai.rapids.cudf.nvtx.enabled=true' \ --conf spark.task.cpus=1 \ --conf spark.rapids.sql.explain=ALL \ --conf spark.locality.wait=0 \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.sql.shuffle.partitions=24 \ --conf spark.sql.files.maxPartitionBytes=128m \ --conf spark.sql.warehouse.dir=$OUT \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.rapids.sql.incompatibleOps.enabled=true \ --conf spark.rapids.sql.variableFloatAgg.enabled=true \ --conf spark.rapids.sql.concurrentGpuTasks=2 \ --conf spark.sql.adaptive.enabled=true \ --conf spark.rapids.memory.gpu.pool=ARENA \ --conf spark.rapids.memory.pinnedPool.size=${PINNEDMEMORY}G \ --conf spark.rapids.memory.gpu.allocFraction=0.9 \ --conf spark.rapids.sql.castStringToInteger.enabled=true \ --conf spark.rapids.memory.pinnedPool.size=5g \ --conf spark.task.resource.gpu.amount=0.08 \ --conf spark.executor.resource.gpu.amount=1 \ --conf spark.rapids.shuffle.ucx.enabled=false \ --conf spark.rapids.sql.parquet.debug.dumpPrefix=$PWD/dump/dump \ --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ --files ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh \ --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR} \ ../main.py \ --sqlFile sql.txt \ --format parquet \ --appName "GPU Q4" \ --tableFile sql4_table_2::$T2_PATH --tableFile sql4_table_3::$T3_PATH --tableFile sql4_table_4::$T4_PATH



**Additional context**
I know that the memory allocation fraction is only a initial size for the RMM. And I guess the failure is RMM still tries to allocation more during query run and I have no more spare memory on my graphic card. But when I set to 0.8, there're enough spare memory to allocate. Let me call the allocated memory "second-mem"
So my question is: since 0.8 + "second-mem" is under the size of my card, why 0.9 + "second-mem" cause OOM? 

One capture during the 0.9 setting run:

![image](https://user-images.githubusercontent.com/20476954/104717022-ad8cd980-5763-11eb-9196-6cee6c8b6eaa.png)

during 0.8:
![image](https://user-images.githubusercontent.com/20476954/104717286-03fa1800-5764-11eb-9f42-4768c10d543b.png)
tgravescs commented 3 years ago

the issue here is that you have more then just Rapids using the GPU, you see Xorg and gnome-shell also using it (normal terminal graphics related things). So basically at spark.rapids.memory.gpu.allocFraction=0.9, you are over allocated the GPU memory and you run out. Rapids tries to use 90% but other processes are using 10+% already.

When you change it to 0.8 then Rapids will try to use less which leaves room for your normal graphics related processes and you don't run out of memory.

revans2 commented 3 years ago

I think we could do a better job with this. We can query the amount of free memory on the GPU and at least give a good error message about how the GPU already has memory allocated and we cannot fulfill the requested allocFraction we could also give a hint about what to set it to.

abellina commented 3 years ago

We can query the amount of free memory on the GPU and at least give a good error message about how the GPU already has memory allocated and we cannot fulfill the requested allocFraction

We do checking part, but we are not suggesting what config to tune (maybe the error could be improved):

https://github.com/NVIDIA/spark-rapids/blob/branch-0.4/sql-plugin/src/main/scala/com/nvidia/spark/rapids/GpuDeviceManager.scala#L166

In this particular scenario, I am confused why the error wasn't Maximum pool size exceeded, or "tried to grow but couldn't." (https://github.com/rapidsai/rmm/blob/1f398d7736a227820fb6df680f52369a20c93358/include/rmm/mr/device/pool_memory_resource.hpp#L188)

wjxiz1992 commented 3 years ago

Thanks for the comments here! Some additional info I forgot to mention: yes, like @abellina said, the RMM has already succeeded allocated 90% from my GPU memory and the query is in the middle of real processing.

image

My main question is: why B case can pass while A case got OOM. Because I think they should have the same USABLE SPACE : (TOTAL_GPU_MEM - OTHER_PROCESS).

Maybe I missed something here about memory usage details, please correct me. Thanks!

revans2 commented 3 years ago

@wjxiz1992

There are other problems associated with using free memory/USABLE SPACE instead of total memory. Mainly it boils down to running on dedicated hardware in a data center vs running on shared hardware on a desktop or workstation.

If you are running on a workstation it is expected that the GPU is going to be shared with other processes, that are not likely to use up all of the GPU's memory. If you are running in a production environment it is typically unexpected for other processes to share the same GPU, and if they do share the same GPU then you want the memory used by each process to not depend on which process started to use the GPU first.

So if the cluster is mis-configured, or something happens a process on the GPU is not killed quickly and two processes end up using the same GPU. In this case would launch Spark and take 90% of what is left. This may then result in a crash quickly if there is not enough memory to do anything, or it might just end up fragmenting the memory and the crash comes much later on when we do a join or something else that taxes the memory more. Or there might be no crash and instead the program runs much slower because it has less memory available to run with.

Another example would be if you want to share a GPU for some reason. If you have two processes that are going to share a GPU. If we use relative/free space to share the GPU you now have to launch the first one (process A) with a 50% limit and wait for it to come up fully. Then launch the other (process B) with a 100% limit. This feels very counter intuitive.

Because Spark is really designed to run on distributed systems in environments like Kubernetes and Yarn where they only ever schedule an entire GPU at once, then we thought it best to optimize the settings for that type of an environment.

revans2 commented 3 years ago

Sorry I should have read more closely.

I am not 100% sure on this, but I suspect that the issue is caused by kernel loading. We reserve about 1GB of memory to load kernels dynamically. This is based off of total memory, not free memory again. So if you allocated 90% of all memory and had essentially nothing free then you could get some cuda errors where it tried to load/launch a kernel but could not because there was no free memory on the GPU to do it. Where as the 80% allocation caused enough free space to still be available that it could load/launch the kernel.

jlowe commented 3 years ago

Yes, the error here has to do with the task stack space. Some of the regular expression kernels require a lot of stack space to run, and this is not part of the RMM pool. Configuring too much RMM pool space can leave insufficient free space to allocate task stack space for some of the extremely stack-space-hungry kernels.

wjxiz1992 commented 3 years ago

@revans2 Thanks a lot for the clear explanation! It helps me understand deeper for this project. And yes as you said, I only see this on my local PC once, never in any other environment with a resource(allocator) manager that provides a clean workspace for GPU.

sameerz commented 3 years ago

@wjxiz1992 is your question answered, can this issue be closed?

wjxiz1992 commented 3 years ago

@sameerz Yes, close this as question answered.