Open yossibm opened 2 years ago
I'm not an expert in the Java side of things but I am pretty familiar with how the dataset scanner (which is in C++) works. The dataset scanner is going to try and read multiple files at the same time. In fact, with parquet, it will actually try and concurrently read multiple batches within a file.
In addition, the dataset scanner is going to readahead a certain amount. For example, even if you only ask for one batch it will read more than one batch. It tries to accumulate enough of a "buffer" that an I/O slowdown won't cause a hitch in processing (this is very similar, for example, to the type of buffering that happens when you watch a youtube video).
I expect Arrow to use the amount of memory that corresponds to a single batch multiplied by the amount of files, but in reality the memory used is much more then the entire files.
This is not quite accurate because of the above readahead.
The files were created with pandas default config (using pyarrow), and reading them in java gives the correct values.
How many record batches are in each file? Do you know roughly how large each record batch is?
Reading 20 uncompressed parquet files with total size 3.2GB, takes more then 12GB in RAM, when reading them "concurrently".
Is this 3.2GB per parquet file? Or 3.2GB across all parquet files?
In 9.0.0 I think the default readahead configuration will read ahead up to 4 files and aims for about 2Mi rows per file. However, the Arrow datasets parquet reader will only read entire row groups. So, for example, if your file is one row group with 20Mi rows then it will be forced to read all 20Mi rows. I believe the pandas/pyarrow default will create 64Mi rows per row group.
So if each file is 3.2GB and each file is a single row group then I would expect to see about 4 files worth of data in memory as part of the readahead which is pretty close to 12GB of RAM.
You can tune the readahead (at least in C++) and the row group size (during the write) to try and find something workable with 9.0.0. Ultimately though I think we will want to someday support partial row group reads from parquet (we should be able to aim for page level resolution). This is tracked by https://issues.apache.org/jira/browse/ARROW-15759 but I'm not aware of anyone working on this at the moment so for the current time I think you are stuck with controlling the size of row groups that you are writing.
3.2GB is accross all the parquet files, but if created without the dictionary encoding it was around 11GB, so I suspected it loaded the entire files. I have noticed the 64M row groups and tried with much lower sizes, such as 128, but it had the same effect. anyway, I couldn't afford to invest more time in this so yesterday I have converted all of my files (which are much more then 20) to feather and it works fine. Thanks
Has anyone taken a further look at this? We are also running into an issue from Java when using the Dataset scanner, where it seems that the reader is pulling the entire file into memory, which is causing large memory pressure
No one is working on this as far as I know. It is something (on the C++ side) that is on my personal roadmap. I'm hoping to get some time to poke around at this in the next release (13.0.0)
Reading 20 uncompressed parquet files with total size 3.2GB, takes more then 12GB in RAM, when reading them "concurrently".
"concurrently" means that I need to read the second file before closing the first file, not multithreading.
The data is time series, so my program needs to read all the files up to some time, and then proceed.
I expect Arrow to use the amount of memory that corresponds to a single batch multiplied by the amount of files, but in reality the memory used is much more then the entire files.
The files were created with pandas default config (using pyarrow), and reading them in java gives the correct values.
when reading each file to the fullest, and then closing the file, the amount of ram used is ok.
I have tried to switch between the netty, and unsafe memory jars but they have the same results.
-Darrow.memory.debug.allocator=true
did not produce any error.trying to limit the amount of direct memory (the excess memory is outside of the JVM) I have tried to replace
NativeMemoryPool.getDefault()
withNativeMemoryPool.createListenable(DirectReservationListener.instance())
orNativeMemoryPool.createListenable(.. some custom listener ..)
but the result is exception:
using
-XX:MaxDirectMemorySize=1g
,-Xmx4g
anyways had no effect.the runtime is using env varibale:
_JAVA_OPTIONS="--add-opens=java.base/java.nio=ALL-UNNAMED"
on JDK 17.0.2 with arrow 9.0.0the code is extracted to this simple example, taken from the official documentation:
the question is - why Arrow using so much memory in a straight-forward example, and why replacing the NativeMemoryPool results in crash?
I guess that the excessive memory is because of extracting the dictionary, and that the JNI part of the code is extracting the files fully. maybe this would be solved if the NativeMemoryPool part was working?
Thanks