Open FelixYBW opened 1 month ago
It's caused by config: spark.gluten.memory.overAcquiredMemoryRatio
The config is introduced when Velox spill isn't mature enough. On every request, Gluten will researve 30% more memory. So Velox can only use about 70% offheap memory size.
Now Velox's spill is more and more mature, we may decrease the ratio to 10% or 0 and see if there is any bugs.
In this case there are still "kill by yarn" which means there are still much memory allocation not tracked.
@Yohahaha, @ulysses-you @zhli1142015 @jackylee-ch @kecookier @surnaik @WangGuangxin In case you didn't noted it.
I think it's safer to remove that option as of now as long as we can run enough tests to prove our assumption. The daily TPC test we are internally using is way from covering real word cases.
I've filed PR https://github.com/apache/incubator-gluten/pull/7384. We can proceed once we are confident.
+
I think it's safer to remove that option as of now as long as we can run enough tests to prove our assumption. The daily TPC test we are internally using is way from covering real word cases.
I've filed PR #7384. We can proceed once we are confident.
I agree, let's remove the config for now. If there are bugs in future, we can fix the underlying issue
Decrease the config to 0 will cause more "killed by yarn". But "killed by yarn" usually is caused by velox bug.
Decrease the config to 0 will cause more "killed by yarn". But "killed by yarn" usually is caused by velox bug.
Let's run some tests and if it's true, we can increase the default memory overhead to address it.
It's caused by config:
spark.gluten.memory.overAcquiredMemoryRatio
The config is introduced when Velox spill isn't mature enough. On every request, Gluten will researve 30% more memory. So Velox can only use about 70% offheap memory size.Now Velox's spill is more and more mature, we may decrease the ratio to 10% or 0 and see if there is any bugs.
In this case there are still "kill by yarn" which means there are still much memory allocation not tracked.
@Yohahaha, @ulysses-you @zhli1142015 @jackylee-ch @kecookier @surnaik @WangGuangxin In case you didn't noted it.
Thanks for the information, I always set it to 0 in our jobs.
Once I set overAcquireMemoryRatio, more killed by Yarn error happens. I use 1GB/task thread now.
Once I set overAcquireMemoryRatio, more killed by Yarn error happens. I use 1GB/task thread now.
Yes setting it to zero may cause less Velox spills. So it's possible that the real RSS of the process may increase to cause yarn kill.
Once I set overAcquireMemoryRatio, more killed by Yarn error happens. I use 1GB/task thread now.
Yes setting it to zero may cause less Velox spills. So it's possible that the real RSS of the process may increase to cause yarn kill.
can the over acquire memory be used by overhead memory allocation? I'd expect if I decrease the ratio, I should see OOM instead of killed by yarn
Backend
VL (Velox)
Bug description
offheap is 8.5GB, no fallback. When an operator researves ~6.5G the spill is triggered. ~0.75% of offheap size.
@zhztheplayer