[VL] Spill related issues

zhztheplayer commented 11 months ago

Description

Mirror issue in facebookincubator/velox https://github.com/facebookincubator/velox/issues/6414

This is to list the large memory occupations that are not spillable so far, which means, that are not able to be spilled to disk, in Velox backend's query execution.

Technically the listed items should be all finally fixed ("fix" means to make then spillable), to ensure the memory stability of Gluten. Otherwise there would be chance that OOM error raises during execution that would fail the user query.

The list of non-spillable large occupations (attach PR following each item once fixing):

[x] Buffered inputs from Velox's window operator
- [x] Streaming window
- [x] Streaming window build
- [ ] Streaming window functions without build, not planned in Velox yet
- [x] Spillable sort window
[x] Buffered inputs from Velox's hash-aggregate operator, when aggregate is distinct aggregate
[x] Buffered inputs from Velox's hash-aggregate operator, when aggregate is partial aggregate (needs confirmation)
[x] Buffered input in Velox's hash-aggregate/hash-join(build)/sort operator, after all input is added
- [x] Hash-aggregate
- [ ] Hash-join(build) Velox community is working on this now.
- [x] Sort
[x] Pre-allocate split buffers from Gluten's Velox shuffle writer
[x] A task can take use executor's memory if no other task running in the executor, TPCDS Q67. Vanilla spark does this
[ ] External sort in fallbacked partition write ( can’t be triggered by gluten)

winningsix commented 11 months ago

Before we got spill support from Velox, what's our current plan?

From initial PR, it seems we want to introduce a mem cap for each task attempt?

Yohahaha commented 11 months ago

I have same question with @winningsix , we introduce over-acquire concept before to hold more memory reservations from Spark as buffer to try avoid OOM, and #3101 seems introduce memory limit for each Spark task to try avoid OOM, does these two feature are exclusive or not?

If these Velox operators are still non-spillable, and all available bytes are fixed, operator's used bytes are fixed in specific query, I doubt how much benefits could gains from above features, is there a case that OOM before but success after enable these features?

Yohahaha commented 11 months ago

CC @liujiayi771

FelixYBW commented 10 months ago

update:

Buffered inputs from Velox's window operator
Buffered inputs from Velox's hash-aggregate operator, when aggregate is distinct aggregate Velox PR created: https://github.com/facebookincubator/velox/issues/3263
Buffered inputs from Velox's hash-aggregate operator, when aggregate is partial aggregate (needs confirmation) Currently Velox flush the partial agg once OOM
Buffered input in Velox's hash-aggregate/hash-join(build)/sort operator, after all input is added hashagg support in Velox is done: https://github.com/facebookincubator/velox/pull/6903
Pre-allocate split buffers from Gluten's Velox shuffle writer couple of shuffle writer modifications listed here: • Merged:
2982 Dynamically adjust split buffer size.

3036 Get avaliable off-heap memory for split buffer calculation everytime split() is called

3199 Continuation of #2982. Bug fix & add UT.

3159 Track memory allocation of split buffer and cached payload separately.

3091 Remove preferSpill=True.

3265 Shrink minimum partition buffer size and add spill support for partition buffers.

3177 Refactor split buffer allocation. (Only code refactor, no functional change)

• WIP:

3265 Shrink min sized partition buffers and spill

FelixYBW commented 10 months ago

Another fix related:

release all previous operator's memory when shufflewriter's stop is called. Because when shuffle write's stop is called all batches are processed. So the shuffle writer can compress the cached batches and write to page cache. PRS: https://github.com/oap-project/gluten/pull/3526

FelixYBW commented 9 months ago

update:

Buffered inputs from Velox's window operator
Buffered inputs from Velox's hash-aggregate operator, when aggregate is distinct aggregate Supported
Buffered inputs from Velox's hash-aggregate operator, when aggregate is partial aggregate (needs confirmation) Currently Velox flush the partial agg once OOM
Buffered input in Velox's hash-aggregate/hash-join(build)/sort operator, after all input is added hashagg supported
Pre-allocate split buffers from Gluten's Velox shuffle writer
release all previous operator's memory when shufflewriter's stop is called. Because when shuffle write's stop is called all batches are processed. So the shuffle writer can compress the cached batches and write to page cache. Supported

XinShuoWang commented 9 months ago

@FelixYBW @zhztheplayer @zhouyuan Hi, can you give me more details about 5. Pre-allocate split buffers from Gluten's Velox shuffle writer? Like minimal reproducible example or related documents?