apache / arrow

Apache Arrow is the universal columnar format and multi-language toolbox for fast data interchange and in-memory analytics
https://arrow.apache.org/
Apache License 2.0
14.6k stars 3.54k forks source link

[C++] ReadRangeCache should not retain data after read #32846

Open asfimport opened 2 years ago

asfimport commented 2 years ago

I've added a unit test of the issue here: https://github.com/westonpace/arrow/tree/experiment/read-range-cache-retention

We use the ReadRangeCache for pre-buffering IPC and parquet files. Sometimes those files are quite large (gigabytes). The usage is roughly:

for X in num_row_groups: CacheAllThePiecesWeNeedForRowGroupX WaitForPiecesToArriveForRowGroupX ReadThePiecesWeNeedForRowGroupX

However, once we've read in row group X and passed it on to Acero, etc. we do not release the data for row group X. The read range cache's entries vector still holds a pointer to the buffer. The data is not released until the file reader itself is destroyed which only happens when we have finished processing an entire file.

This leads to excessive memory usage when pre-buffering is enabled.

This could potentially be a little difficult to implement because a single read range's cache entry could be shared by multiple ranges so we will need some kind of reference counting to know when we have fully finished with an entry and can release it.

Reporter: Weston Pace / @westonpace Assignee: Percy Camilo Triveño Aucahuasi / @aucahuasi Watchers: Rok Mihevc / @rok

Related issues:

Note: This issue was originally created as ARROW-17599. Please see the migration documentation for further details.

asfimport commented 2 years ago

David Li / @lidavidm: You could perhaps store each entry as a custom Buffer subclass (which delegates to the actual buffer) and hand out shared pointers of that.

asfimport commented 2 years ago

Percy Camilo Triveño Aucahuasi / @aucahuasi: Should ReadRangeCache::read remove the cache entry after performing the read?

From the documentation, is not clear if that method should remove the cache entry; I did a simple experiment (removing the range) and the unit test provided by @westonpace is passing:

 


if (it != entries.end() && it->range.Contains(range)) {
  ...
  this->entries.erase(it);
   ...
}

 

This is just an experiment to understand better the issue.

Also, I tried to explore @lidavidm's idea, but I think I need more hints about how we can store each cache entry as a custom buffer; so far what I understand is that the data is being wrapped/eaten by the RandomAccessFile and that is the reason why the release won't happen until the file reader is destroyed (there is no way to access to the internal data buffer held by RandomAccessFile) 

Weston, it would be great to know full use case you were running; right now I'm using the unit test, but it would help to replicate the issue with the full use case locally (maybe the use case needs an override method for ReadRangeCache::read that can delete the range at the end)

asfimport commented 2 years ago

Weston Pace / @westonpace:

Should ReadRangeCache::read remove the cache entry after performing the read?

Yes. I don't think this is mentioned in the documentation. It may not have been a concern at the time. I think we should also update the documentation so that we are very clear that this happens.

Also, I tried to explore David Li's idea, but I think I need more hints about how we can store each cache entry as a custom buffer; so far what I understand is that the data is being wrapped/eaten by the RandomAccessFile and that is the reason why the release won't happen until the file reader is destroyed (there is no way to access to the internal data buffer held by RandomAccessFile)

The FileReader owns a single instance of ReadRangeCache. That instance won't be deleted until the FileReader is deleted. The ReadRangeCache has a vector of RangeCacheEntry. Currently, nothing removes items from that vector. A RangeCacheEntry has a Future<std::shared_ptr<Buffer>>. Once that future has been filled it will hold the result (in case callbacks are added later) and so this will keep the buffer alive (because there is still a shared_ptr referencing it).

Weston, it would be great to know full use case you were running; right now I'm using the unit test, but it would help to replicate the issue with the full use case locally (maybe the use case needs an override method for ReadRangeCache::read that can delete the range at the end)

The use case is described in more detail in https://issues.apache.org/jira/browse/ARROW-17590 (which has a reproducing script) but a slightly more involved test would be:

Create a 4GiB parquet file with 20 row groups. Each row group should be about 200MiB. Scan the file with pyarrow to_batches (just count the rows or something). The scanner should only read at most 2 row groups in at a time. So I'd expect to see around 0.5GiB peak RAM. However, in practice, you will see 4GiB peak RAM.

asfimport commented 2 years ago

Percy Camilo Triveño Aucahuasi / @aucahuasi: Thanks Weston,

Should ReadRangeCache::read remove the cache entry after performing the read?

Yes. I don't think this is mentioned in the documentation. It may not have been a concern at the time. I think we should also update the documentation so that we are very clear that this happens.

It seems that ParquetFileReader::PreBuffer was implemented under a different assumption, from the API docs:

"After calling this, creating readers for row groups/column indices that were not buffered may fail. {}Creating multiple readers for the a subset of the buffered regions is acceptable{}. This may be called again to buffer a different set of row groups/columns."

I did run the script provided in ARROW-17590 and was able to reproduce the issue. Also, I was able to check that we are reading multiple times the same cache entry and observe that removing the entry after ReadRangeCache::read is breaking the contract required by ParquetFileReader::PreBuffer.

I'll keep investigating, any other ideas are more than welcome!

asfimport commented 2 years ago

Weston Pace / @westonpace: Ack. I was not really aware that was how the parquet reader operated. That comment is very helpful. Hmm, in that case maybe a better fix is to improve how we scan parquet files. Currently we get an async generator from a parquet reader for the entire file. The code for it is here.

This prebuffers the entire range of row groups before we even start reading. In practice I think we only want to prebuffer a row group right before we're ready to actually read that row group.

asfimport commented 2 years ago

Weston Pace / @westonpace: Although the more I think about it the less I'm sure that parquet reader API makes sense. Why would someone want to prebuffer a chunk of data and then read from it multiple times?

@lidavidm any thoughts on which approach we should take?

asfimport commented 2 years ago

David Li / @lidavidm: I think a lot of this was just because of how it was historically added: originally, the cache was added without adding an iterator interface, so the cache would necessarily have to preserve the input data. I think now that we're changing things here, we should perhaps consider adding an explicit cache-based, iterator/generator-based API so that the API contract is clear.

However, I think we still do want to pre-buffer all row groups, because that way the read coalescing can do the best job possible. That said it probably needs benchmarking to determine what makes sense. It could work to only start reads for a row group when we read it (with the understanding that some I/O may 'spill over' into the next row group for optimal I/O patterns)

asfimport commented 2 years ago

Percy Camilo Triveño Aucahuasi / @aucahuasi: Follow up ticket:

https://issues.apache.org/jira/browse/ARROW-18065

asfimport commented 2 years ago

Percy Camilo Triveño Aucahuasi / @aucahuasi: Another follow up/related ticket:

https://issues.apache.org/jira/browse/ARROW-18113

asfimport commented 1 year ago

Percy Camilo Triveño Aucahuasi / @aucahuasi: Now that we have RandomAccessFile::ReadManyAsync, I would like to start gathering some ideas about how to implement the fix for this ticket by using the new capability.

In the previous attempt, we discovered that the pre-buffering process doesn't handle concurrent use and that a better long-term solution would be to separate caching and coalescing in ReadRangeCache.

Currently, I think we still need to wait for this PR (and maybe another one too) before start working on this ticket, but it would be great to start discussing about how to use this new API for solving this issue here.