apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.63k stars 1.41k forks source link

ColumnChunkPageWriter uses only heap memory. #2060

Open asfimport opened 7 years ago

asfimport commented 7 years ago

After PARQUET-160 was resolved, ColumnChunkPageWriter started using ConcatenatingByteArrayCollector. There are all data is collected in the List of byte[], before writing the page. No way to use direct memory for allocating buffers. ByteBufferAllocator is present in the ColumnChunkPageWriter class, but never used.

Using of java heap space in some cases can cause OOM exceptions or GC's overhead. ByteBufferAllocator should be used in the ConcatenatingByteArrayCollector or OutputStream classes.

Reporter: Vitalii Diravka / @vdiravka Assignee: Vitalii Diravka / @vdiravka

Related issues:

Note: This issue was originally created as PARQUET-1006. Please see the migration documentation for further details.

asfimport commented 3 years ago

Vitalii Diravka / @vdiravka: How to assign the task?

asfimport commented 3 years ago

Gabor Szadovszky / @gszadovszky: Added you to the list of contributors so parquet jiras can be assigned to you.

You've set affected version to 1.8.0. It is fine but keep in mind that it is unlikely that the potential fix for this jira will be released in the 1.8.x branch since it is a feature and not a bugfix. It might be released in the next minor release.

asfimport commented 3 years ago

Vitalii Diravka / @vdiravka: 1.8.0 is the version of Parquet, when the ticket is created. I've updated it to 1.12.0. Also I have asked in parquet dev list how to solve it in better way: Could you advise me the right way of CapacityByteArrayOutputStream usage? I am going to use it in ColumnChunkPageWriteStore instead of ByteArrayOutputStream tempOutputStream and ConcatenatingByteArrayCollector buf. The purpose of it that CapacityByteArrayOutputStream can use custom ByteBufferAllocator allocator with direct memory usage instead of bytes[] in heap (DirectByteBufferAllocator or similar other one). The question is that fine way for ColumnChunkPageWriteStore? And is it fine to provide(add) an API for ColumnChunkPageWriteStore to use CapacityByteArrayOutputStream?

asfimport commented 3 years ago

Gabor Szadovszky / @gszadovszky: @vdiravka, it is fine to have 1.8.0 in the affected versions field. I've just wanted to highlight that we won't release a potential fix in that branch but only in the next release based on master.

I've written my answer to the dev list. Let it be here as well so one doesn't have to search in the archives later.

CapacityByteArrayOutputStream is not only about the selectable allocator but the growing mechanism as well. Based on the its documentation you will need a good maxCapacityHint which I am not sure you have in case of a column chunk. We have size limits/hints for pages and row groups but don't have such things for column chunks. If you set the hint too high you may end up allocating too much space however, it should not be worse than the existing ByteArrayOutputStream. However, if you set it too low then you might end up too many allocations at growing which could hit performance. If you can come up with good maxCapacityHint and prove with performance tests that the change is not slower than the original, I am fine with this update. About the API. ColumnChunkPageWriteStore is not part of the public API of parquet-mr. I know it is public from java point of view but it has never meant to be used directly. It is neither a pro nor a con to add a new public method just good to know what we are extending. I think, if performance tests approve, it would be cleaner to simply change the ByteArrayOutputStream to CapacityByteArrayOutputStream without any new API.

asfimport commented 3 years ago

Vitalii Diravka / @vdiravka: Sounds reasonable and we don't need in 1.8. Updated Fix Version/s: to 1.13.0

Thanks for reposting it here, because I don't see the answer in mail, possibly I am absent in CC and I'm not subscribed to dev@parquet.apache.org Please add me in CC Agree about issue with maxCapacityHint. I will try compare performance for both cases. Could you clarify me, for heap memory we are fine to do buffering or byte collecting without maxCapacityHint but for direct memory we should specify it. Is that correct? Or it is just difference between CapacityByteArrayOutputStream and ConcatenatingByteArrayCollector?

asfimport commented 3 years ago

Gabor Szadovszky / @gszadovszky: @vdiravka,

If you are interested in the thread on the dev list you may either subscribe to the list or check the list or the thread itself on ponymail.

CapacityByteArrayOutputStream (name is misleading) requires an initial slab size, a max capacity hint and an allocator. If you want to use it you have to come up with a good hint independently from which allocator would you use. I think it is a good idea to use CapacityByteArrayOutputStream instead of the currently used ByteBufferOutputStream but you need to implement it in a way that it does not hurt performance neither for direct nor for heap allocations. ConcatenatingByteArrayCollector is for a different purpose. It is not even an OutputStream.