Closed turboFei closed 3 weeks ago
On many file systems, a seek backwards to read the data after reading the footer results in slower reads because the fs switches from a sequential read to a random read (which typically turns off pre-fetching and other optimizations enabled in sequential reads). It might be worth considering if reusing the stream is worth it.
Thanks parthchandra for the comments.
For our company internal managed spark, we reuse the inputstream for parquet file.
Before that:
A spark task will open the file multiple times to read footer and data.
When the HDFS nameNode is under high pressure, it will cost time.
After that, it only open the parquet file for one time.
This is the testing 3years ago on Spark-2.3.
It reduces 3/2 hdfs RPC requests to namenode.
And after this Spark patch in community [https://github.com/apache/spark/pull/39950]([SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader), the solution might reduce 1/2 hdfs RPC requests.
It looks reasonable to me and users can choose their best fit.
cc @gszadovszky @steveloughran
@wgtmac @gszadovszky Could we merge this PR?
Thank you all.
Rationale for this change
Support to transfer the parquet file inputstream when building the ParquetFileReader, so that we can re-use the existing inputstream and reduce the open file rpcs.
What changes are included in this PR?
As title.
Are these changes tested?
Existing UT. It only a new constructors.
Are there any user-facing changes?
No break change.
Closes #3031