[BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method)

mtsol commented 1 year ago

Describe the bug This exception occures after a certain level of executions:

2023-04-04 08:08:52 WARN DAGScheduler:69 - Broadcasting large task binary with size 1811.4 KiB 2023-04-04 08:11:05 WARN TaskSetManager:69 - Lost task 0.0 in stage 443.0 (TID 2841) (10.84.179.52 executor 2): java.lang.OutOfMemoryError: Could not allocate native memory: std::bad_alloc: out_of_memory: RMM failure at:/home/jenkins/agent/workspace/jenkins-spark-rapids-jni-release-1-cuda11/thirdparty/cudf/cpp/build/_deps/rmm-src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) at ai.rapids.cudf.Table.concatenate(Table.java:1635) at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$concatPending$2(GpuKeyBatchingIterator.scala:138) at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:64) at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:62) at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34) at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$concatPending$1(GpuKeyBatchingIterator.scala:123) at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28) at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26) at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34) at com.nvidia.spark.rapids.GpuKeyBatchingIterator.concatPending(GpuKeyBatchingIterator.scala:122) at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$3(GpuKeyBatchingIterator.scala:166) at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28) at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26) at com.nvidia.spark.rapids.GpuKeyBatchingIterator.withResource(GpuKeyBatchingIterator.scala:34) at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$2(GpuKeyBatchingIterator.scala:165) at com.nvidia.spark.rapids.GpuKeyBatchingIterator.$anonfun$next$2$adapted(GpuKeyBatchingIterator.scala:162) at com.nvidia.spark.rapids.Arm.withResource(Arm.scala:28) at com.nvidia.spark.rapids.Arm.withResource$(Arm.scala:26)

Steps/Code to reproduce bug /u/bin/spark-3.1.1-bin-hadoop2.7/bin/spark-submit --master k8s://https:/k8s-master:6443 --deploy-mode cluster --name app-name --conf spark.local.dir=/y/mcpdata --conf spark.kubernetes.executor.request.cores=1 --conf spark.executor.cores=1 --conf spark.executor.instances=2 --conf spark.executor.memory=120G --conf spark.scheduler.mode=FAIR --conf spark.scheduler.allocation.file=/opt/spark/bin/fair_example.xml --conf spark.dynamicAllocation.enabled=false --conf spark.executor.heartbeatInterval=3600s --conf spark.network.timeout=36000s --conf spark.sql.broadcastTimeout=36000 --conf spark.driver.memory=70G --conf spark.kubernetes.namespace=default --conf spark.driver.maxResultSize=50g --conf spark.kubernetes.container.image.pullPolicy=Always --conf spark.pyspark.driver.python=python3.8 --conf spark.pyspark.python=python3.8 --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --conf spark.kubernetes.container.image=repo/app:tag --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-logs-volume.mount.path=/y/mcpdata --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-pipeline-stages-volume.mount.path=/u/bin/pipeline_stages --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-visualizations-volume.mount.path=/u/bin/evaluation_visualizations --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-visualizations-volume.options.claimName=fe-visualizations-volume --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-pipeline-stages-volume.options.claimName=fe-pipeline-stages-volume --conf spark.kubernetes.driver.volumes.persistentVolumeClaim.fe-logs-volume.options.claimName=fe-logs-volume --conf spark.kubernetes.driver.label.driver=driver --conf spark.kubernetes.spec.driver.dnsConfig=default-subdomain --conf spark.kubernetes.driverEnv.ENV_SERVER=QA --conf spark.executorEnv.ENV_SERVER=QA --conf spark.sql.adaptive.enabled=false --conf spark.plugins=com.nvidia.spark.SQLPlugin --conf spark.kubernetes.executor.podTemplateFile=/tmp/templates/gpu-template.yaml --conf spark.executor.resource.gpu.amount=1 --conf spark.task.resource.gpu.amount=1 --conf spark.rapids.sql.rowBasedUDF.enabled=true --conf spark.rapids.sql.concurrentGpuTasks=1 --conf spark.executor.resource.gpu.vendor=nvidia.com --conf spark.rapids.memory.gpu.oomDumpDir=/y/mcpdata --conf spark.rapids.memory.pinnedPool.size=50g --conf spark.executor.memoryOverhead=25g --conf spark.rapids.sql.batchSizeBytes=32m --conf spark.executor.resource.gpu.discoveryScript=/getGpusResources.sh --conf spark.rapids.sql.explain=ALL --conf spark.rapids.memory.host.spillStorageSize=20g --conf spark.sql.shuffle.partitions=50 --conf spark.eventLog.enabled=true --conf spark.eventLog.dir=/y/mcpdata/ --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.path=/u/spark-tmp --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.mount.readOnly=false --conf spark.kubernetes.driver.volumes.hostPath.spark-local-dir-1.options.path=/u/spark-tmp --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.path=/u/spark-tmp --conf spark.sql.cache.serializer=com.nvidia.spark.ParquetCachedBatchSerializer --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.mount.readOnly=false --conf spark.kubernetes.executor.volumes.hostPath.spark-local-dir-1.options.path=/u/spark-tmp local:///code.py &

Expected behavior some cudf.Table went out of memory.

Environment details (please complete the following information)

Environment location: Kubernetes
Spark configuration settings related to the issue: mentioned in spark-submit commadn
Data details: 2 million rows with 50 columns
Spark is processing the data, but after certain number of stages, it brakes.

revans2 commented 1 year ago

@mtsol

When discussing this we were a little confused if this fails randomly like the GPU memory is near the limit on what it can support and some times it works, while other times it fails, or if this looks more like a memory leak where running in X times always works, but X+1 times crashes?

We are working on a way to mitigate situations like this #7778 The goal is to have this in the 23.06 release. If you want to try and test it sooner I can see if we can come up with a version you could try out.

mtsol commented 1 year ago

I would appreciate if you can provide me something to test it sooner.

mtsol commented 1 year ago

"if this fails randomly like the GPU memory is near the limit on what it can support and some times it works, while other times it fails, or if this looks more like a memory leak where running in X times always works, but X+1 times crashes?"

Ans: In my case, it crashes always X+1 time, like when I have 1.2 million rows in my dataset, everything works fine, but when I increase the data, it crashes with this error, it crashes on 1.5 million, 2 million rows as well, and any number of data between them. I cannot say if it is related to memory leak, but as I observed the error occurs when data increases a certain limit.

PS: I will appreciate if you can provide and jar prior to the release to test if that one works fine with our data.

mtsol commented 1 year ago

After debugging and analysis, I found in my code that this statement:

df = df.withColumn(self.output_col_name, concat_ws(col_sep, array(self.input_col_name_list)))

was causing the error on larger data on gpus. And I think there is some bug in the gpu optimization of concat function, which needs to be addressed.

revans2 commented 1 year ago

@mstol thanks for the updated info. concat_ws can be a memory hog, especially if you are not also dropping the input columns after concatenating them together. We are aware that we have some problems with batch sizes when doing a ProjectExec that adds more rows. We have plans to work on this https://github.com/NVIDIA/spark-rapids/issues/7257 is the epic to work on it.

I am guessing that you just removed that line from your query, and because of that it dropped the total memory pressure at that point in time and for data being processed after it.

revans2 commented 1 year ago

@mtsol I have a snapshot jar that you can try.

https://drive.google.com/file/d/15RyaI5OyeSJNEj5G-W4MnN8JeyQPq4ff/view?usp=sharing

Be aware that there are some known bugs with it. Specifically https://github.com/NVIDIA/spark-rapids/issues/8147 which is caused by https://github.com/rapidsai/cudf/issues/13173 so it should go without saying, but don't use this in production, and avoid the substring command if you can.

If you want a better version I can upload another one once the issue is fixed.

NVIDIA / spark-rapids

[BUG] src/include/rmm/mr/device/limiting_resource_adaptor.hpp:143: Exceeded memory limit at ai.rapids.cudf.Table.concatenate(Native Method) #8021