Open alamb opened 2 years ago
This got me interesting, so I started looking into it, and I'm not sure how we aim to tackle it.
My 1st idea was to write a Decorator which implements ExecutionPlan
and which replace the allocator in some way, but that seems non-feasible due to the fact that it both requires replacing the GlobalAllocator
(which is way more intrusive than I imagine DataFusion wanting to ever be), and will not work well-given concurrency (can't know what thread allocated that memory). An example for a GlobalAllocator
Decorator can be found in: https://github.com/cuviper/alloc_geiger
The other approach I found (from the article below) is implementing something similar to Servo's malloc_size_of
which can be found in: https://github.com/servo/servo/blob/faf3a183f3755a9986ec4379abadf3523bd8b3c0/components/malloc_size_of/lib.rs
This solution is quite intrusive (from what I can see) and requires manually "registering" any memory allocation to add up to a per ExecutionPlan
sum.
Not sure where to go from here, would love to hear some feedback.
Some references: https://rust-analyzer.github.io/blog/2020/12/04/measuring-memory-usage-in-rust.html
I was kind of imagining we would have to do something like manually registering memory allocations. the malloc_size_of
trait is a cool idea.
While it would be likely be crazy complicated to do this for all allocations, I think all the built in DataFusion operators use most of their memory in intermediate RecordBatches and a potential single large structure (e.g. the hash tables in hash_join and hash_aggregate) If we captured these large sources I think that would get us most of the value
Cool, so I dug through the code a bit, and this seems to be a bit out of my league (needs high familiarity with way too many things). Thank you for the response!
Is your feature request related to a problem or challenge? Please describe what you are trying to do. When reviewing a plan, it would be nice to know the amount of memory each individual
ExecutionPlan
allocated during its execution.Describe the solution you'd like Add two new metrics to all operators:
"Allocated" should include both memory in created record batches as well as any internal memory (as described in #898 -- hopefully this code would just use the same underlying allocation measurement)
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Probably could follow the same model as https://github.com/apache/arrow-datafusion/issues/866 (baseline metrics for all operators) once that is implemented
https://github.com/apache/arrow-datafusion/issues/898 is for tracking overall memory allocations across all operators in a plan. This issue is for tracking the allocations for each individual operator