apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
5.93k stars 1.13k forks source link

Default parquet reader to reading 64K footer #4459

Open alamb opened 1 year ago

alamb commented 1 year ago

As of https://github.com/apache/arrow-datafusion/pull/4427 it is easier to see that the DataFusion parquet reader still defaults to reading the last 4 bytes of a parquet file (which contains the metadata length) and then does a second read to read the footer.

Doing two IO operations is likely non ideal, especially for object storage where the cost of an additional read is very expensive relative to reading a bit more data in the first read.

The suggestion is to default reading the last 64k of a parquet file to try and capture the entire footer in a single read

_Originally posted by @thinkharderdev in https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1032437474_

alamb commented 1 year ago

Any thoughts @tustvold or @Ted-Jiang ?

Ted-Jiang commented 1 year ago

Make sense to me, I think we have to notice the best practice to user keeping footer size less than 64k.
And i can not find a tool to read parquet footer size 😂

thinkharderdev commented 1 year ago
#!/bin/bash

le=`xxd -p -s -8 -l 4 $1`;
be=${le:6:2}${le:4:2}${le:2:2}${le:0:2};
printf "Footer has size $le=%d\n" $((16#$be));

./paruqet_size.sh file.parquet

😄

alamb commented 1 year ago

And i can not find a tool to read parquet footer size 😂

While the bash script is quite compelling from a dependencies point of view, I have been dreaming (though haven't found time) to contribute to @manojkarthick 's https://github.com/manojkarthick/pqrs -- I think with some more contributions that tool could be come "the parquet-tools I actually want to use"