[C++][Parquet] Reserve Memory when Reading Binary Data Types

WillAyd commented 3 months ago

Describe the enhancement requested

The performance behavior for reading binary data types through parquet is much different than say integral types. While of course these aren't expected to be identical, I was surprised to see a lot of Append calls in a performance trace of the parquet reader with strings.

To illustrate, I have created the following data:

import pyarrow as pa
import pyarrow.parquet as pq

tbl1 = pa.Table.from_pydict({"col": range(10_000_000)})
pq.write_table(tbl1, "ints.parquet")

tbl2 = pa.Table.from_pydict({"col": ["foo", "bar"] * 5_000_000})
pq.write_table(tbl2, "strings.parquet")

And written two simple benchmarks against these files. read_ints.py:

import pyarrow.parquet as pq

for _ in range(10):
    pq.read_table("ints.parquet")

and read_strings.py

import pyarrow.parquet as pq

for _ in range(10):
    pq.read_table("strings.parquet")

When executing these under callgrind, here is what I see for the integer benchmark:

10,978,640,541 (55.79%)  ???:std::pair<unsigned char const*, long> snappy::DecompressBranchless<char*>(unsigned char const*, unsigned char const*, long, char*, long) [/home/willayd/mambaforge/envs/scratchpad/lib/libsnappy.so.1.2.0]
 2,594,207,330 (13.18%)  ???:snappy::MemCopy64(char*, void const*, unsigned long) [/home/willayd/mambaforge/envs/scratchpad/lib/libsnappy.so.1.2.0]
 2,395,482,666 (12.17%)  ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:__memcpy_avx_unaligned_erms [/usr/lib/x86_64-linux-gnu/libc.so.6]
   810,937,500 ( 4.12%)  ???:parquet::internal::GreaterThanBitmapAvx2(short const*, long, short) [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
   598,618,440 ( 3.04%)  ???:snappy::DeferMemCopy(void const**, unsigned long*, void const*, unsigned long) [/home/willayd/mambaforge/envs/scratchpad/lib/libsnappy.so.1.2.0]
   198,977,226 ( 1.01%)  ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:memcpy@GLIBC_2.2.5 [/usr/lib/x86_64-linux-gnu/libc.so.6]
   177,894,713 ( 0.90%)  /usr/local/src/conda/python-3.12.1/Modules/_sre/sre_lib.h:sre_ucs1_match [/home/willayd/mambaforge/envs/scratchpad/bin/python3.12]
   133,685,200 ( 0.68%)  ???:int arrow::util::RleDecoder::GetBatchWithDict<long>(long const*, int, long*, int) [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
   117,187,500 ( 0.60%)  ???:long parquet::internal::standard::DefLevelsBatchToBitmap<false>(short const*, long, long, parquet::internal::LevelInfo, arrow::internal::FirstTimeBitmapWriter*) [clone .isra.0] [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
    77,081,177 ( 0.39%)  ./elf/./elf/dl-lookup.c:do_lookup_x [/usr/lib/x86_64-linux-gnu/ld-linux-x86-64.so.2]

versus with strings

7,100,000,000 (33.57%)  ???:arrow::BaseBinaryBuilder<arrow::BinaryType>::Append(unsigned char const*, int) [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
3,300,018,590 (15.60%)  ???:arrow::BufferBuilder::Append(void const*, long) [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
3,300,004,000 (15.60%)  ???:arrow::ArrayBuilder::Reserve(long) [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
2,601,226,650 (12.30%)  ???:parquet::(anonymous namespace)::DictByteArrayDecoderImpl::DecodeArrowDenseNonNull(int, parquet::EncodingTraits<parquet::PhysicalType<(parquet::Type::type)6> >::Accumulator*, int*) [clone .constprop.0] [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
1,671,172,008 ( 7.90%)  ./string/../sysdeps/x86_64/multiarch/memmove-vec-unaligned-erms.S:memcpy@GLIBC_2.2.5 [/usr/lib/x86_64-linux-gnu/libc.so.6]
  810,937,500 ( 3.83%)  ???:parquet::internal::GreaterThanBitmapAvx2(short const*, long, short) [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
  200,000,204 ( 0.95%)  ???:arrow::ArrayBuilder::length() const [/home/willayd/mambaforge/envs/scratchpad/lib/libarrow.so.1500.2.0]
  177,894,713 ( 0.84%)  /usr/local/src/conda/python-3.12.1/Modules/_sre/sre_lib.h:sre_ucs1_match [/home/willayd/mambaforge/envs/scratchpad/bin/python3.12]
  140,038,870 ( 0.66%)  ???:int arrow::bit_util::BitReader::GetBatch<int>(int, int*, int) [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]
  117,187,500 ( 0.55%)  ???:long parquet::internal::standard::DefLevelsBatchToBitmap<false>(short const*, long, long, parquet::internal::LevelInfo, arrow::internal::FirstTimeBitmapWriter*) [clone .isra.0] [/home/willayd/mambaforge/envs/scratchpad/lib/libparquet.so.1500.2.0]

Is the string reader expected to be so heavy on the Append? I am by no means an expert in the parquet format, but I believe that there is a total_uncompressed_size in the column metadata that might be useable to pre-allocate the buffer for binary data so that we don't have to spend as much time in Append calls

The above IR is from running the benchmarks on pyarrow 15.0.2

Component(s)

C++

WillAyd commented 3 months ago

IIUC we have the required metadata at this point in the parquet reader:

https://github.com/apache/arrow/blob/e0f31aa1d4007ebce01cd4bca369d12c4d083162/cpp/src/parquet/file_reader.cc#L117

The metadata however is not forwarded along to the factory function that creates the RecordReader(s). For primitive types, I think the readers work around this by doing things like PARQUET_THROW_NOT_OK(data_builder_.Reserve(num_decoded * byte_width_));, but that doesn't help much for binary types without a defined bytewidth.

Have to still research more but am hopeful forwarding that metadata can help performance and streamline some of the RecordReader code

mapleFU commented 3 months ago

Wouldn't this depend on the behavior and read-batch-size? The cases can separate to issues below:

1. When dictionary encoding is enabled
1.1. If you read by dictionary, maybe a detail can choose reuse same dictionary for a rowgroup, but when multiple row-group is choosen, it would be more difficult
1.2. If you not read by dictionary, append should be called
2. When use PLAIN/DELTA_LENGTH_BYTE_ARRAY. Reserve for a row-group or a file is possible, however, a chunk might limit by 2GB
3. When use DELTA_BYTE_ARRAY, Reserve by total_uncompressed_size might be impossible

Besides, should user use smaller batch size rather than read a whole table. I think this interface change would be harder than expected...

mapleFU commented 3 months ago

But anyway, if you're sure you want to read the whole file, this could be done when you want to read a "LargeString/LargeBinary" ( considering the 2GB limits)

WillAyd commented 3 months ago

Wouldn't this depend on the behavior and read-batch-size? The cases can separate to issues below:

These are great callouts. I see in the implementation that there is a ByteArrayChunkedRecordReader and a ByteArrayDictionaryRecordReader, and I assume this would only affect the former

Besides, should user use smaller batch size rather than read a whole table. I think this interface change would be harder than expected...

I am (possibly mistakenly) under the impression that the metadata for each column in a parquet file is stored by chunk. But I do have a lot to learn here - will keep this in mind as I look at the code more closely. Thanks for the notes!

mapleFU commented 3 months ago

Yeah I mean we can do it for some scenerio if we can detect the scenerio(when we want to read the whole file, or whole rowgroup into a table), but maybe that's not commonly currently.

WillAyd commented 3 months ago

Makes sense. Where I originally found this was profiling some TPC-H benchmarks using pandas + pyarrow. Definitely could be other ways of handling it as an end user, but some changes here would at least help in the "benchmarks game"

apache / arrow

[C++][Parquet] Reserve Memory when Reading Binary Data Types #41224

Describe the enhancement requested

Component(s)