Open alamb opened 1 year ago
Any thoughts @tustvold or @Ted-Jiang ?
Make sense to me, I think we have to notice the best practice to user keeping footer size less than 64k.
And i can not find a tool to read parquet footer size 😂
#!/bin/bash
le=`xxd -p -s -8 -l 4 $1`;
be=${le:6:2}${le:4:2}${le:2:2}${le:0:2};
printf "Footer has size $le=%d\n" $((16#$be));
./paruqet_size.sh file.parquet
😄
And i can not find a tool to read parquet footer size 😂
While the bash script is quite compelling from a dependencies point of view, I have been dreaming (though haven't found time) to contribute to @manojkarthick 's https://github.com/manojkarthick/pqrs -- I think with some more contributions that tool could be come "the parquet-tools I actually want to use"
As of https://github.com/apache/arrow-datafusion/pull/4427 it is easier to see that the DataFusion parquet reader still defaults to reading the last 4 bytes of a parquet file (which contains the metadata length) and then does a second read to read the footer.
Doing two IO operations is likely non ideal, especially for object storage where the cost of an additional read is very expensive relative to reading a bit more data in the first read.
The suggestion is to default reading the last 64k of a parquet file to try and capture the entire footer in a single read
_Originally posted by @thinkharderdev in https://github.com/apache/arrow-datafusion/pull/3885#discussion_r1032437474_