Closed wjxiz1992 closed 3 years ago
the issue here is that you have more then just Rapids using the GPU, you see Xorg and gnome-shell also using it (normal terminal graphics related things). So basically at spark.rapids.memory.gpu.allocFraction=0.9, you are over allocated the GPU memory and you run out. Rapids tries to use 90% but other processes are using 10+% already.
When you change it to 0.8 then Rapids will try to use less which leaves room for your normal graphics related processes and you don't run out of memory.
I think we could do a better job with this. We can query the amount of free memory on the GPU and at least give a good error message about how the GPU already has memory allocated and we cannot fulfill the requested allocFraction
we could also give a hint about what to set it to.
We can query the amount of free memory on the GPU and at least give a good error message about how the GPU already has memory allocated and we cannot fulfill the requested allocFraction
We do checking part, but we are not suggesting what config to tune (maybe the error could be improved):
In this particular scenario, I am confused why the error wasn't Maximum pool size exceeded
, or "tried to grow but couldn't." (https://github.com/rapidsai/rmm/blob/1f398d7736a227820fb6df680f52369a20c93358/include/rmm/mr/device/pool_memory_resource.hpp#L188)
Thanks for the comments here! Some additional info I forgot to mention: yes, like @abellina said, the RMM has already succeeded allocated 90% from my GPU memory and the query is in the middle of real processing.
My main question is: why B case can pass while A case got OOM. Because I think they should have the same USABLE SPACE : (TOTAL_GPU_MEM - OTHER_PROCESS).
Maybe I missed something here about memory usage details, please correct me. Thanks!
@wjxiz1992
There are other problems associated with using free memory/USABLE SPACE instead of total memory. Mainly it boils down to running on dedicated hardware in a data center vs running on shared hardware on a desktop or workstation.
If you are running on a workstation it is expected that the GPU is going to be shared with other processes, that are not likely to use up all of the GPU's memory. If you are running in a production environment it is typically unexpected for other processes to share the same GPU, and if they do share the same GPU then you want the memory used by each process to not depend on which process started to use the GPU first.
So if the cluster is mis-configured, or something happens a process on the GPU is not killed quickly and two processes end up using the same GPU. In this case would launch Spark and take 90% of what is left. This may then result in a crash quickly if there is not enough memory to do anything, or it might just end up fragmenting the memory and the crash comes much later on when we do a join or something else that taxes the memory more. Or there might be no crash and instead the program runs much slower because it has less memory available to run with.
Another example would be if you want to share a GPU for some reason. If you have two processes that are going to share a GPU. If we use relative/free space to share the GPU you now have to launch the first one (process A) with a 50% limit and wait for it to come up fully. Then launch the other (process B) with a 100% limit. This feels very counter intuitive.
Because Spark is really designed to run on distributed systems in environments like Kubernetes and Yarn where they only ever schedule an entire GPU at once, then we thought it best to optimize the settings for that type of an environment.
Sorry I should have read more closely.
I am not 100% sure on this, but I suspect that the issue is caused by kernel loading. We reserve about 1GB of memory to load kernels dynamically. This is based off of total memory, not free memory again. So if you allocated 90% of all memory and had essentially nothing free then you could get some cuda errors where it tried to load/launch a kernel but could not because there was no free memory on the GPU to do it. Where as the 80% allocation caused enough free space to still be available that it could load/launch the kernel.
Yes, the error here has to do with the task stack space. Some of the regular expression kernels require a lot of stack space to run, and this is not part of the RMM pool. Configuring too much RMM pool space can leave insufficient free space to allocate task stack space for some of the extremely stack-space-hungry kernels.
@revans2 Thanks a lot for the clear explanation! It helps me understand deeper for this project. And yes as you said, I only see this on my local PC once, never in any other environment with a resource(allocator) manager that provides a clean workspace for GPU.
@wjxiz1992 is your question answered, can this issue be closed?
@sameerz Yes, close this as question answered.
Describe the bug I was running an LHA query on my PC in standalone mode. When set
spark.rapids.memory.gpu.allocFraction=0.9
, there's error ofcudaErrorMemoryAllocation: out of memory
:But set
spark.rapids.memory.gpu.allocFraction=0.8
, the error went away.Steps/Code to reproduce bug It's an LHA query so I only describe the main steps here:
Expected behavior A query should run successfully with a higher memory allocation fraction if it is able to pass with a lower number.
Environment details (please complete the following information)
Set path for source table data
T2_PATH=/home/allxu/tmp/q4/t2 T3_PATH=/home/allxu/tmp/q4/t3 T4_PATH=/home/allxu/tmp/q4/t4
Set path for your query output
Here your query result(dataset) will be saved to the 'output' folder under your current working directory
OUT=output
Set your Spark master IP:Port
MASTER=*****
Remove existed query result data first
Or you cannot run this query successfully
rm -rf $OUT
$SPARK_HOME/bin/spark-submit --master $MASTER \ --driver-memory ${DRIVE_MEMORY}G \ --executor-memory ${EXECUTOR_MEMORY}G \ --executor-cores $EXECUTOR_CORES \ --num-executors $NUM_EXECUTOR \ --conf spark.executor.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR} \ --conf spark.driver.extraClassPath=${SPARK_CUDF_JAR}:${SPARK_RAPIDS_PLUGIN_JAR} \ --conf spark.driver.extraJavaOptions=-Dai.rapids.cudf.nvtx.enabled=true \ --conf spark.executor.extraJavaOptions='-Dai.rapids.cudf.prefer-pinned=true -Dai.rapids.cudf.nvtx.enabled=true' \ --conf spark.task.cpus=1 \ --conf spark.rapids.sql.explain=ALL \ --conf spark.locality.wait=0 \ --conf spark.yarn.maxAppAttempts=1 \ --conf spark.sql.shuffle.partitions=24 \ --conf spark.sql.files.maxPartitionBytes=128m \ --conf spark.sql.warehouse.dir=$OUT \ --conf spark.plugins=com.nvidia.spark.SQLPlugin \ --conf spark.rapids.sql.incompatibleOps.enabled=true \ --conf spark.rapids.sql.variableFloatAgg.enabled=true \ --conf spark.rapids.sql.concurrentGpuTasks=2 \ --conf spark.sql.adaptive.enabled=true \ --conf spark.rapids.memory.gpu.pool=ARENA \ --conf spark.rapids.memory.pinnedPool.size=${PINNEDMEMORY}G \ --conf spark.rapids.memory.gpu.allocFraction=0.9 \ --conf spark.rapids.sql.castStringToInteger.enabled=true \ --conf spark.rapids.memory.pinnedPool.size=5g \ --conf spark.task.resource.gpu.amount=0.08 \ --conf spark.executor.resource.gpu.amount=1 \ --conf spark.rapids.shuffle.ucx.enabled=false \ --conf spark.rapids.sql.parquet.debug.dumpPrefix=$PWD/dump/dump \ --conf spark.executor.resource.gpu.discoveryScript=./getGpusResources.sh \ --files ${SPARK_HOME}/examples/src/main/scripts/getGpusResources.sh \ --jars ${SPARK_CUDF_JAR},${SPARK_RAPIDS_PLUGIN_JAR} \ ../main.py \ --sqlFile sql.txt \ --format parquet \ --appName "GPU Q4" \ --tableFile sql4_table_2::$T2_PATH --tableFile sql4_table_3::$T3_PATH --tableFile sql4_table_4::$T4_PATH