Closed leventov closed 1 year ago
The new approach could be combined with experiments on off-heap space pre-allocation for dimension values. Currently off-heap space is allocated on-demand, right during query processing, that plagues the processing loop with branches. Even more branches are required with the new approach.
Also, off-heap space could be pre-allocated based on popularity of dimension values, from most popular to least popular, that would improve memory and page cache usage.
Also, this approach may not work well if numBytesPerRecord
is comparable with the chunk size, e. g. for sketches. In this case, as large as possible area of continuous memory could be allocated in the new-style pool, and then the processing uses the old approach with multiple runs over the data.
What problems it solves:
- Only a single config needed, that is much easier to get right: the total amount of memory, that user dedicates for off-heap processing on the Druid instance.
- On heterogeneous query loads, resources are used more effectively.
I agree these are problems worth solving. The current memory allocation system does have too many configs and it's difficult to size them optimally (and, like you pointed out, can be impossible to achieve really optimal behavior for heterogenous workloads).
Despite the weaknesses of the current system, it does have one good feature: once a query starts it doesn't ever need to block waiting to allocate more memory. The memory is all reserved up front, and then the query proceeds until complete. If queries allocated memory more dynamically, we need some way of dealing with a situation where two queries both want to allocate the "last chunk" of available memory.
@gianm this feature is still here in the proposed approach, and the logic of "multiple runs" is not changed. To prevent situation when one or several expensive queries take all available off-heap memory, dynamic constraints could be added, like no query could receive more memory, than current amount of free off-heap memory, divided by the current number of idle processing threads. In the worst case it would always divide resources as fair as the current approach (when all queries are equally expensive), but if cheaper queries are run in parallel with more expensive queries, it would allow for expensive queries to allocate more memory, while cheaper queries are not hurt.
Dropping a comment to follow the thread.
To prevent situation when one or several expensive queries take all available off-heap memory, dynamic constraints could be added, like no query could receive more memory, than current amount of free off-heap memory, divided by the current number of idle processing threads.
That's the situation I was concerned about (an expensive query takes all available off-heap memory).
Also this idea implemented, it would allow to set different constraints for different queries, e. g. not allowing some low-priority queries to take a lot of memory.
problems mentioned here are great to be solved. so, my understanding is, Druid would allocate a big block of memory (of configured size) on startup and would queries would get chunks off of it. if configured size if more than 2G then "big block of memory" could be a list of ByteBuffer objects of 2G each or a big Memory object. This will all be hidden inside an Allocator
abstraction from which rest of druid (queries for example) would request memory.
Biggest problem in such scenario is to deal with fragmentation. One options is that Allocator
would allow giving off chunks of only two possible sizes (say, 4MB and 64K, these sizes could be made configurable if necessary). That way there is no fragmentation problem as you can keep giving chunks of 64K from head of big memory block and chunks of 4MB from tail of the big memory block.
Druid query implementations should be adjusted to work within this constraint. For example GroupByQuery hash table implementation shouldn't assume that a big 1-2 G block of memory is available but rather a list of 4MB memory chunks (not contiguous in address space) can be requested from Allocator
, same goes for TopN query etc. This might also require us to put limits on how big one single row could be e.g. it can't get bigger than 4MB.
64K (smaller size) chunks would be used for temporary decompression buffers etc.
"big block of memory" could be a list of ByteBuffer objects of 2G each or a big Memory object.
No, the problem is that it must be just one continuous block of memory, not a list of blocks. With a list of blocks, the logic in the tight loop (e. g. in topN) becomes considerably more expensive.
Biggest problem in such scenario is to deal with fragmentation. One options is that Allocator would allow giving off chunks of only two possible sizes (say, 4MB and 64K, these sizes could be made configurable if necessary).
There is no need for blocks of different sizes. All blocks could be of the same size, hence no fragmentation.
There is no need for blocks of different sizes. All blocks could be of the same size, hence no fragmentation.
what will be the size of BB groupBy query processing, topN query processing and temporary decompression buffers would ask for ?
GroupBy and and TopN and decompression use different sets of buffers already. Also in a long run, decompression buffers should not be needed, see https://github.com/druid-io/druid/issues/4080
GroupBy and decompression use different sets of buffers already.
my goal would be that druid process should not allocate any more memory than configured for the "big memory block" , so decompression buffers should come from the same big allocation if they exist.
However, what size buffer would be requested by groupBy and topN query processing ? will they request multiple chunks of xx MB ? In that case, do they need to be contiguous in address space?
my goal would be that druid process should not allocate any more memory than configured for the "big memory block" , so decompression buffers should come from the same big allocation if they exist.
Yes, it would be better if needed. Then compression block size in segment format should be coordinated with the virtual chunk size in this "big memory block". Then decompression could claim one block. Currently compression block size in segment format is 64K.
However, what size buffer would be requested by groupBy and topN query processing ?
As much as they need, but not more than remaining free memory / # of IDLE processing threads
.
will they request multiple chunks of xx MB ?
Yes. Or multiple chunks of 64K size.
In that case, do they need to be contiguous in address space?
No, but they all need to be a part of larger "big memory block", that should be continuous as a whole.
No, but they all need to be a part of larger "big memory block", that should be continuous as a whole.
However, this is not needed, because we could convert multiple direct memory allocations of < 2 GB size virtually into one "giant". The whole point of this requirement is the ability to operate with a single long offset
in certain hot loops, instead of pair ByteBuffer/Memory + offset
.
This proposal looks good. Regarding the situation when an expensive query blocks all other queries, @leventov's solution (https://github.com/druid-io/druid/issues/4422#issuecomment-309556971) sounds good, but, in the end, I think it should be addressed as a separate issue like resource preemption.
BTW there is a relatively simple (though hacky) workaround of ByteBuffer's 2GB limitation without implementing #3892 in full: we can manage a single continuous block of off-heap memory as Memory
object and convert slices of it to ByteBuffer
for further use in Druid's core query processing logic using the unsafeByteBufferView()
method.
This issue has been marked as stale due to 280 days of inactivity. It will be closed in 4 weeks if no further activity occurs. If this issue is still relevant, please simply write any comment. Even if closed, you can still revive the issue at any time or discuss it on the dev@druid.apache.org list. Thank you for your contributions.
This issue has been closed due to lack of activity. If you think that is incorrect, or the issue requires additional review, you can revive the issue at any time.
Currently we have a pool of large (usually they are configured so) buffers, on each query that needs off-heap memory one buffer is allocated for the query. If the query requires more off-heap memory that the buffer size, the data is processed in several runs, each run on a on a subset of dimension values, clearing the buffer between the runs.
Problems:
The different approach
Off-heap memory is allocated in relatively small chunks, fixed sized, e. g. 64K. Fixed chunk size is needed to avoid fragmentation, and to allow simple, lock-free allocation logic.
When a query needs more off-heap memory, it is provided with a list of allocated chunks. But it is not a list of separate
ByteBuffer
objects (it would kill query processing performance), it is a list of offsets within a single giant ByteBuffer.Currently space allocation for a dimension value looks like this:
Will be just a little more complex:
Otherwise the query processing speed shouldn't be affected.
What problems it solves:
Problem with this approach
Currently ByteBuffer size is limited to 2 GB, that complicates resource sharing. It would be better to allocate a single giant memory area for all off-heap memory. #3892 will solve this.