Agile, config-less approach to off-heap memory management

leventov commented 7 years ago

Currently we have a pool of large (usually they are configured so) buffers, on each query that needs off-heap memory one buffer is allocated for the query. If the query requires more off-heap memory that the buffer size, the data is processed in several runs, each run on a on a subset of dimension values, clearing the buffer between the runs.

Problems:

lack of agility, if the buffer size is configured too small, queries which require a lot off-heap memory are processed in many runs, i. e. too slow, while there could actually be enough free off-heap memory in the system at the moment to process the query in a little number of runs, or in a single run. On the other hand, buffer size couldn't be too big, the limit is the total amount of off-heap memory available (total memory dedicated for the Druid instance - heap memory size) divided by the number of processing threads.
need to configure bufferSize and number of buffers in the first place. Configs are evil. Sometimes users configure inadequate values and don't understand why the system perform poorly.

The different approach

Off-heap memory is allocated in relatively small chunks, fixed sized, e. g. 64K. Fixed chunk size is needed to avoid fragmentation, and to allow simple, lock-free allocation logic.

When a query needs more off-heap memory, it is provided with a list of allocated chunks. But it is not a list of separate ByteBuffer objects (it would kill query processing performance), it is a list of offsets within a single giant ByteBuffer.

Currently space allocation for a dimension value looks like this:

positions[dimIndex] = currentPosition * numBytesPerRecord;
currentPosition++;

Will be just a little more complex:

if (currentChunkOffset + numBytesPerRecord > currentChunkLimit) {
  currentChunkOffset = listOfChunks.get(chunkIndex++);
  currentChunkLimit = currentChunkOffset + chunkSize;
}
positions[dimIndex] = currentChunkOffset;
currentChunkOffset += numBytesPerRecord;

Otherwise the query processing speed shouldn't be affected.

What problems it solves:

Only a single config needed, that is much easier to get right: the total amount of memory, that user dedicates for off-heap processing on the Druid instance.
On heterogeneous query loads, resources are used more effectively.

Problem with this approach

Currently ByteBuffer size is limited to 2 GB, that complicates resource sharing. It would be better to allocate a single giant memory area for all off-heap memory. #3892 will solve this.

leventov commented 7 years ago

The new approach could be combined with experiments on off-heap space pre-allocation for dimension values. Currently off-heap space is allocated on-demand, right during query processing, that plagues the processing loop with branches. Even more branches are required with the new approach.

Also, off-heap space could be pre-allocated based on popularity of dimension values, from most popular to least popular, that would improve memory and page cache usage.

leventov commented 7 years ago

Also, this approach may not work well if numBytesPerRecord is comparable with the chunk size, e. g. for sketches. In this case, as large as possible area of continuous memory could be allocated in the new-style pool, and then the processing uses the old approach with multiple runs over the data.

gianm commented 7 years ago

What problems it solves:

Only a single config needed, that is much easier to get right: the total amount of memory, that user dedicates for off-heap processing on the Druid instance.

On heterogeneous query loads, resources are used more effectively.

I agree these are problems worth solving. The current memory allocation system does have too many configs and it's difficult to size them optimally (and, like you pointed out, can be impossible to achieve really optimal behavior for heterogenous workloads).

Despite the weaknesses of the current system, it does have one good feature: once a query starts it doesn't ever need to block waiting to allocate more memory. The memory is all reserved up front, and then the query proceeds until complete. If queries allocated memory more dynamically, we need some way of dealing with a situation where two queries both want to allocate the "last chunk" of available memory.

leventov commented 7 years ago

@gianm this feature is still here in the proposed approach, and the logic of "multiple runs" is not changed. To prevent situation when one or several expensive queries take all available off-heap memory, dynamic constraints could be added, like no query could receive more memory, than current amount of free off-heap memory, divided by the current number of idle processing threads. In the worst case it would always divide resources as fair as the current approach (when all queries are equally expensive), but if cheaper queries are run in parallel with more expensive queries, it would allow for expensive queries to allocate more memory, while cheaper queries are not hurt.

b-slim commented 7 years ago

Dropping a comment to follow the thread.

gianm commented 7 years ago

To prevent situation when one or several expensive queries take all available off-heap memory, dynamic constraints could be added, like no query could receive more memory, than current amount of free off-heap memory, divided by the current number of idle processing threads.

That's the situation I was concerned about (an expensive query takes all available off-heap memory).

leventov commented 7 years ago

Also this idea implemented, it would allow to set different constraints for different queries, e. g. not allowing some low-priority queries to take a lot of memory.

himanshug commented 7 years ago

problems mentioned here are great to be solved. so, my understanding is, Druid would allocate a big block of memory (of configured size) on startup and would queries would get chunks off of it. if configured size if more than 2G then "big block of memory" could be a list of ByteBuffer objects of 2G each or a big Memory object. This will all be hidden inside an Allocator abstraction from which rest of druid (queries for example) would request memory.

Biggest problem in such scenario is to deal with fragmentation. One options is that Allocator would allow giving off chunks of only two possible sizes (say, 4MB and 64K, these sizes could be made configurable if necessary). That way there is no fragmentation problem as you can keep giving chunks of 64K from head of big memory block and chunks of 4MB from tail of the big memory block. Druid query implementations should be adjusted to work within this constraint. For example GroupByQuery hash table implementation shouldn't assume that a big 1-2 G block of memory is available but rather a list of 4MB memory chunks (not contiguous in address space) can be requested from Allocator , same goes for TopN query etc. This might also require us to put limits on how big one single row could be e.g. it can't get bigger than 4MB. 64K (smaller size) chunks would be used for temporary decompression buffers etc.

leventov commented 7 years ago

"big block of memory" could be a list of ByteBuffer objects of 2G each or a big Memory object.

No, the problem is that it must be just one continuous block of memory, not a list of blocks. With a list of blocks, the logic in the tight loop (e. g. in topN) becomes considerably more expensive.

Biggest problem in such scenario is to deal with fragmentation. One options is that Allocator would allow giving off chunks of only two possible sizes (say, 4MB and 64K, these sizes could be made configurable if necessary).

There is no need for blocks of different sizes. All blocks could be of the same size, hence no fragmentation.

himanshug commented 7 years ago

There is no need for blocks of different sizes. All blocks could be of the same size, hence no fragmentation.

what will be the size of BB groupBy query processing, topN query processing and temporary decompression buffers would ask for ?

leventov commented 7 years ago

GroupBy and and TopN and decompression use different sets of buffers already. Also in a long run, decompression buffers should not be needed, see https://github.com/druid-io/druid/issues/4080

himanshug commented 7 years ago

GroupBy and decompression use different sets of buffers already.

my goal would be that druid process should not allocate any more memory than configured for the "big memory block" , so decompression buffers should come from the same big allocation if they exist.

However, what size buffer would be requested by groupBy and topN query processing ? will they request multiple chunks of xx MB ? In that case, do they need to be contiguous in address space?

leventov commented 7 years ago

my goal would be that druid process should not allocate any more memory than configured for the "big memory block" , so decompression buffers should come from the same big allocation if they exist.

Yes, it would be better if needed. Then compression block size in segment format should be coordinated with the virtual chunk size in this "big memory block". Then decompression could claim one block. Currently compression block size in segment format is 64K.

However, what size buffer would be requested by groupBy and topN query processing ?

As much as they need, but not more than remaining free memory / # of IDLE processing threads.

will they request multiple chunks of xx MB ?

Yes. Or multiple chunks of 64K size.

In that case, do they need to be contiguous in address space?

No, but they all need to be a part of larger "big memory block", that should be continuous as a whole.

leventov commented 7 years ago

No, but they all need to be a part of larger "big memory block", that should be continuous as a whole.

However, this is not needed, because we could convert multiple direct memory allocations of < 2 GB size virtually into one "giant". The whole point of this requirement is the ability to operate with a single long offset in certain hot loops, instead of pair ByteBuffer/Memory + offset.

jihoonson commented 7 years ago

This proposal looks good. Regarding the situation when an expensive query blocks all other queries, @leventov's solution (https://github.com/druid-io/druid/issues/4422#issuecomment-309556971) sounds good, but, in the end, I think it should be addressed as a separate issue like resource preemption.

leventov commented 5 years ago

BTW there is a relatively simple (though hacky) workaround of ByteBuffer's 2GB limitation without implementing #3892 in full: we can manage a single continuous block of off-heap memory as Memory object and convert slices of it to ByteBuffer for further use in Druid's core query processing logic using the unsafeByteBufferView() method.

github-actions[bot] commented 1 year ago

This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.

github-actions[bot] commented 1 year ago

This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.

apache / druid

Agile, config-less approach to off-heap memory management #4422

The different approach

Problem with this approach