Open haohuaijin opened 2 months ago
https://github.com/apache/arrow-rs/issues/3960 is likely related.
Tagging @jhorstmann @waynexia @alamb
I personally prefer the allocator approach which gives the most accurate stats in theory. Given we in some aspects encourage sharing underlying buffer and arrays, only changing the behavior of .capacity()
would still cause problems for end users when they want to track how many physical memories are used somewhere (e.g., in one query session).
I personally prefer the allocator approach which gives the most accurate stats in theory.
I agree with @waynexia -- if the usecase is "accurately track the total memory used across some number of Arrays which (potentially) share underlying Buffers
", I think this needs a hook into the underlying allocator. There is proposed PR https://github.com/apache/arrow-rs/pull/6336 which might be worth a look (I haven't had a chance to really understand it yet myself)
I don't know if this is the right place to post this, but if this is being reworked, it would also be very useful to have a total memory usage method available within builders, especially for nested stuff. I'm writing a parser which outputs record batches at a static number of rows, but that doesn't work very well when the goal is to keep memory usage roughly constant. It would be nice to be able to get a roughly accurate measure of memory usage in builders in addition to arrays/buffers.
Builders are an interesting point, but come with two potential challenges to be aware of:
This means any "eager" memory computation is likely to work poorly, however, the lazy approach as currently implemented for arrays could actually work quite well.
I could see a world where we remove lazy memory tracking for arrats in favour of eager allocation tracking, and add lazy memory tracking for builders
That would be super useful for my use case
As this appears to have stalled slightly I have created a POC in https://github.com/apache/arrow-rs/pull/6590 for what I think eager memory tracking for arrays could look like.
I don't really have the capacity to drive this initiative, but I think the approach in the PR should be relatively straightforward for someone to pick up and run with.
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
See this comment https://github.com/apache/arrow-rs/issues/6363#issuecomment-2366788331, when the
Buffer
's data is shared between manyBuffer
, thecapacity()
method will always return the total buffer memory usage, which causes the issue #6363, we need discussion the better memory track for shared buffers. https://github.com/apache/arrow-rs/blob/d05cf6d5e74e79ddcacaa4a68bddaba230b0f163/arrow-buffer/src/buffer/immutable.rs#L166-L168Describe the solution you'd like
The easy way to do this is to return the length and the unused capacity of the
Buffer
like pr #6438,Originally posted by @tustvold in https://github.com/apache/arrow-rs/issues/6438#issuecomment-2367426442
But as @tustvold said, we need more discussion.
Describe alternatives you've considered
Additional context