apache / incubator-gluten

Gluten is a middle layer responsible for offloading JVM-based SQL engines' execution to native engines.
https://gluten.apache.org/
Apache License 2.0
1.13k stars 409 forks source link

[VL] Spill related issues #3030

Open zhztheplayer opened 11 months ago

zhztheplayer commented 11 months ago

Description

Mirror issue in facebookincubator/velox https://github.com/facebookincubator/velox/issues/6414

This is to list the large memory occupations that are not spillable so far, which means, that are not able to be spilled to disk, in Velox backend's query execution.

Technically the listed items should be all finally fixed ("fix" means to make then spillable), to ensure the memory stability of Gluten. Otherwise there would be chance that OOM error raises during execution that would fail the user query.

The list of non-spillable large occupations (attach PR following each item once fixing):

winningsix commented 11 months ago

Before we got spill support from Velox, what's our current plan?

From initial PR, it seems we want to introduce a mem cap for each task attempt?

Yohahaha commented 11 months ago

I have same question with @winningsix , we introduce over-acquire concept before to hold more memory reservations from Spark as buffer to try avoid OOM, and #3101 seems introduce memory limit for each Spark task to try avoid OOM, does these two feature are exclusive or not?

If these Velox operators are still non-spillable, and all available bytes are fixed, operator's used bytes are fixed in specific query, I doubt how much benefits could gains from above features, is there a case that OOM before but success after enable these features?

Yohahaha commented 11 months ago

CC @liujiayi771

FelixYBW commented 10 months ago

update:

  1. Buffered inputs from Velox's window operator
  2. Buffered inputs from Velox's hash-aggregate operator, when aggregate is distinct aggregate Velox PR created: https://github.com/facebookincubator/velox/issues/3263
  3. Buffered inputs from Velox's hash-aggregate operator, when aggregate is partial aggregate (needs confirmation) Currently Velox flush the partial agg once OOM
  4. Buffered input in Velox's hash-aggregate/hash-join(build)/sort operator, after all input is added hashagg support in Velox is done: https://github.com/facebookincubator/velox/pull/6903
  5. Pre-allocate split buffers from Gluten's Velox shuffle writer couple of shuffle writer modifications listed here: • Merged:

    2982 Dynamically adjust split buffer size.

    3036 Get avaliable off-heap memory for split buffer calculation everytime split() is called

    3199 Continuation of #2982. Bug fix & add UT.

    3159 Track memory allocation of split buffer and cached payload separately.

    3091 Remove preferSpill=True.

    3265 Shrink minimum partition buffer size and add spill support for partition buffers.

    3177 Refactor split buffer allocation. (Only code refactor, no functional change)

    • WIP:

    3265 Shrink min sized partition buffers and spill

FelixYBW commented 10 months ago

Another fix related:

  1. release all previous operator's memory when shufflewriter's stop is called. Because when shuffle write's stop is called all batches are processed. So the shuffle writer can compress the cached batches and write to page cache. PRS: https://github.com/oap-project/gluten/pull/3526
FelixYBW commented 9 months ago

update:

  1. Buffered inputs from Velox's window operator
  2. Buffered inputs from Velox's hash-aggregate operator, when aggregate is distinct aggregate Supported
  3. Buffered inputs from Velox's hash-aggregate operator, when aggregate is partial aggregate (needs confirmation) Currently Velox flush the partial agg once OOM
  4. Buffered input in Velox's hash-aggregate/hash-join(build)/sort operator, after all input is added hashagg supported
  5. Pre-allocate split buffers from Gluten's Velox shuffle writer
  6. release all previous operator's memory when shufflewriter's stop is called. Because when shuffle write's stop is called all batches are processed. So the shuffle writer can compress the cached batches and write to page cache. Supported
XinShuoWang commented 9 months ago

@FelixYBW @zhztheplayer @zhouyuan Hi, can you give me more details about 5. Pre-allocate split buffers from Gluten's Velox shuffle writer? Like minimal reproducible example or related documents?

FelixYBW commented 7 months ago

Added a feature that a task can take use whole executor's memory, or it use other tasks' memory if other tasks are idle. It's what Vanilla spark does today. In Gluten we disabled the feature since OOM issue isn't solved. @zhztheplayer

FelixYBW commented 7 months ago

@zhztheplayer buffered input for sort is already supported, right?

Buffered input in Velox's hash-aggregate/hash-join(build)/sort operator, after all input is added Sort

FelixYBW commented 7 months ago

we removed Velox's partial Agg in Velox, use final agg instead because Vanilla spark's partial agg have the same behavior of Velox's final agg. In Vanilla Spark it doesn't support early eviction, instead it spill the data and does full agg in partial agg, it can make sure the agg key is unique in partial agg. On the otherside Velox's partial agg use earl eviction once memory is not enough, it doesn't make sure the agg key is unique in partial agg.

Since we use both final agg, spill is fully supported now

Yohahaha commented 7 months ago

we removed Velox's partial Agg in Velox, use final agg instead because Vanilla spark's partial agg have the same behavior of Velox's final agg. In Vanilla Spark it doesn't support early eviction, instead it spill the data and does full agg in partial agg, it can make sure the agg key is unique in partial agg. On the otherside Velox's partial agg use earl eviction once memory is not enough, it doesn't make sure the agg key is unique in partial agg.

Since we use both final agg, spill is fully supported now

So, flushable agg will be removed, right?

FelixYBW commented 7 months ago

So, flushable agg will be removed, right?

Yes, it's already removed in main branch. @zhztheplayer

FelixYBW commented 7 months ago

Added a feature that a task can take use whole executor's memory, or it use other tasks' memory if other tasks are idle. It's what Vanilla spark does today. In Gluten we disabled the feature since OOM issue isn't solved. @zhztheplayer

Talked with @zhztheplayer offline, we have a flag to control it. by default it's disabled which means the task can use full executor's memory. It may cause issues on some queries though. Once post hashjoin spill is supported, the risk will go down

zhztheplayer commented 7 months ago

Thanks for helping updating this @FelixYBW .

So, flushable agg will be removed, right?

Yes, it's already removed in main branch. @zhztheplayer

We actually have a sub-task to improve FlusahbleAggregateRule to convert eligible aggregations to Velox's flushable aggregate. That rule is currently not working in some corner cases though. @Yohahaha had helped figure out one case https://github.com/oap-project/gluten/issues/4421.

WangGuangxin commented 5 months ago

Added a feature that a task can take use whole executor's memory, or it use other tasks' memory if other tasks are idle. It's what Vanilla spark does today. In Gluten we disabled the feature since OOM issue isn't solved. @zhztheplayer

Talked with @zhztheplayer offline, we have a flag to control it. by default it's disabled which means the task can use full executor's memory. It may cause issues on some queries though. Once post hashjoin spill is supported, the risk will go down

@zhztheplayer @FelixYBW Hi, can you give me more information about the feature "a task can take use whole executor's memory, or it use other tasks' memory if other tasks are idle"? like the config name or related PR in vanilla spark or gluten

FelixYBW commented 3 months ago

Hash-join(build) Velox community is working on this now update: PR merged. https://github.com/facebookincubator/velox/pull/8894 Enabling in Gluten

FelixYBW commented 3 months ago

Streaming window supported function: RANK | ROW_NUMBER Aggregate Functions Syntax: MAX | MIN | COUNT | SUM | AVG | ...

Not supported yet: DENSE_RANK | PERCENT_RANK | NTILE CUME_DIST | LAG | LEAD | NTH_VALUE | FIRST_VALUE | LAST_VALUE