apache / arrow-rs

Official Rust implementation of Apache Arrow
https://arrow.apache.org/
Apache License 2.0
2.57k stars 779 forks source link

[DISSCUSSION] the memory tracking approach that better handles shared buffers #6439

Open haohuaijin opened 1 month ago

haohuaijin commented 1 month ago

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

See this comment https://github.com/apache/arrow-rs/issues/6363#issuecomment-2366788331, when the Buffer's data is shared between many Buffer, the capacity() method will always return the total buffer memory usage, which causes the issue #6363, we need discussion the better memory track for shared buffers. https://github.com/apache/arrow-rs/blob/d05cf6d5e74e79ddcacaa4a68bddaba230b0f163/arrow-buffer/src/buffer/immutable.rs#L166-L168

Describe the solution you'd like

The easy way to do this is to return the length and the unused capacity of the Buffer like pr #6438,

    pub fn capacity(&self) -> usize {
        self.length + self.data.capacity() - self.data.len()
    }
          This is changing what these methods report to something else that I'm not sure is correct. Whilst this formulation may benefit your use-case, there are common situations where it will now under report... Perhaps you might file a ticket to discuss a memory tracking approach that better handles shared buffers, as that is the actual issue here. This is not a bug

Originally posted by @tustvold in https://github.com/apache/arrow-rs/issues/6438#issuecomment-2367426442

But as @tustvold said, we need more discussion.

Describe alternatives you've considered

Additional context

tustvold commented 1 month ago

https://github.com/apache/arrow-rs/issues/3960 is likely related.

Tagging @jhorstmann @waynexia @alamb

waynexia commented 1 month ago

I personally prefer the allocator approach which gives the most accurate stats in theory. Given we in some aspects encourage sharing underlying buffer and arrays, only changing the behavior of .capacity() would still cause problems for end users when they want to track how many physical memories are used somewhere (e.g., in one query session).

alamb commented 1 month ago

I personally prefer the allocator approach which gives the most accurate stats in theory.

I agree with @waynexia -- if the usecase is "accurately track the total memory used across some number of Arrays which (potentially) share underlying Buffers", I think this needs a hook into the underlying allocator. There is proposed PR https://github.com/apache/arrow-rs/pull/6336 which might be worth a look (I haven't had a chance to really understand it yet myself)

kszlim commented 1 week ago

I don't know if this is the right place to post this, but if this is being reworked, it would also be very useful to have a total memory usage method available within builders, especially for nested stuff. I'm writing a parser which outputs record batches at a static number of rows, but that doesn't work very well when the goal is to keep memory usage roughly constant. It would be nice to be able to get a roughly accurate measure of memory usage in builders in addition to arrays/buffers.

tustvold commented 1 week ago

Builders are an interesting point, but come with two potential challenges to be aware of:

This means any "eager" memory computation is likely to work poorly, however, the lazy approach as currently implemented for arrays could actually work quite well.

I could see a world where we remove lazy memory tracking for arrats in favour of eager allocation tracking, and add lazy memory tracking for builders

kszlim commented 1 week ago

That would be super useful for my use case

tustvold commented 1 week ago

As this appears to have stalled slightly I have created a POC in https://github.com/apache/arrow-rs/pull/6590 for what I think eager memory tracking for arrays could look like.

I don't really have the capacity to drive this initiative, but I think the approach in the PR should be relatively straightforward for someone to pick up and run with.