apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

PARQUET-3031: Support to transfer input stream when building ParquetFileReader #3030

Closed turboFei closed 3 weeks ago

turboFei commented 1 month ago

Rationale for this change

Support to transfer the parquet file inputstream when building the ParquetFileReader, so that we can re-use the existing inputstream and reduce the open file rpcs.

What changes are included in this PR?

As title.

Are these changes tested?

Existing UT. It only a new constructors.

Are there any user-facing changes?

No break change.

Closes #3031

parthchandra commented 1 month ago

On many file systems, a seek backwards to read the data after reading the footer results in slower reads because the fs switches from a sequential read to a random read (which typically turns off pre-fetching and other optimizations enabled in sequential reads). It might be worth considering if reusing the stream is worth it.

turboFei commented 1 month ago

Thanks parthchandra for the comments.

For our company internal managed spark, we reuse the inputstream for parquet file.

Before that:

A spark task will open the file multiple times to read footer and data.

When the HDFS nameNode is under high pressure, it will cost time.

After that, it only open the parquet file for one time.

turboFei commented 1 month ago

This is the testing 3years ago on Spark-2.3.

image

It reduces 3/2 hdfs RPC requests to namenode.

And after this Spark patch in community [https://github.com/apache/spark/pull/39950]([SPARK-42388][SQL] Avoid parquet footer reads twice in vectorized reader), the solution might reduce 1/2 hdfs RPC requests.

wgtmac commented 1 month ago

It looks reasonable to me and users can choose their best fit.

cc @gszadovszky @steveloughran

wangyum commented 3 weeks ago

@wgtmac @gszadovszky Could we merge this PR?

wangyum commented 3 weeks ago

Thank you all.