[VL] spill starts when researved size is less than offheap size

apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.

https://gluten.apache.org/

Apache License 2.0

1.22k stars 438 forks source link

[VL] spill starts when researved size is less than offheap size #7380

Open FelixYBW opened 1 month ago

FelixYBW commented 1 month ago

Backend

VL (Velox)

Bug description

offheap is 8.5GB, no fallback. When an operator researves ~6.5G the spill is triggered. ~0.75% of offheap size.

@zhztheplayer

FelixYBW commented 1 month ago

It's caused by config: spark.gluten.memory.overAcquiredMemoryRatio The config is introduced when Velox spill isn't mature enough. On every request, Gluten will researve 30% more memory. So Velox can only use about 70% offheap memory size.

Now Velox's spill is more and more mature, we may decrease the ratio to 10% or 0 and see if there is any bugs.

In this case there are still "kill by yarn" which means there are still much memory allocation not tracked.

@Yohahaha, @ulysses-you @zhli1142015 @jackylee-ch @kecookier @surnaik @WangGuangxin In case you didn't noted it.

zhztheplayer commented 1 month ago

I think it's safer to remove that option as of now as long as we can run enough tests to prove our assumption. The daily TPC test we are internally using is way from covering real word cases.

I've filed PR https://github.com/apache/incubator-gluten/pull/7384. We can proceed once we are confident.

surnaik commented 1 month ago

I think it's safer to remove that option as of now as long as we can run enough tests to prove our assumption. The daily TPC test we are internally using is way from covering real word cases.

I've filed PR #7384. We can proceed once we are confident.

I agree, let's remove the config for now. If there are bugs in future, we can fix the underlying issue

FelixYBW commented 1 month ago

Decrease the config to 0 will cause more "killed by yarn". But "killed by yarn" usually is caused by velox bug.

zhztheplayer commented 1 month ago

Decrease the config to 0 will cause more "killed by yarn". But "killed by yarn" usually is caused by velox bug.

Let's run some tests and if it's true, we can increase the default memory overhead to address it.

Yohahaha commented 1 month ago

It's caused by config: spark.gluten.memory.overAcquiredMemoryRatio The config is introduced when Velox spill isn't mature enough. On every request, Gluten will researve 30% more memory. So Velox can only use about 70% offheap memory size.

Now Velox's spill is more and more mature, we may decrease the ratio to 10% or 0 and see if there is any bugs.

In this case there are still "kill by yarn" which means there are still much memory allocation not tracked.

@Yohahaha, @ulysses-you @zhli1142015 @jackylee-ch @kecookier @surnaik @WangGuangxin In case you didn't noted it.

Thanks for the information, I always set it to 0 in our jobs.

FelixYBW commented 1 week ago

Once I set overAcquireMemoryRatio, more killed by Yarn error happens. I use 1GB/task thread now.

zhztheplayer commented 1 week ago

Once I set overAcquireMemoryRatio, more killed by Yarn error happens. I use 1GB/task thread now.

Yes setting it to zero may cause less Velox spills. So it's possible that the real RSS of the process may increase to cause yarn kill.

FelixYBW commented 1 week ago

Once I set overAcquireMemoryRatio, more killed by Yarn error happens. I use 1GB/task thread now.

Yes setting it to zero may cause less Velox spills. So it's possible that the real RSS of the process may increase to cause yarn kill.

can the over acquire memory be used by overhead memory allocation? I'd expect if I decrease the ratio, I should see OOM instead of killed by yarn