apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.48k stars 1.37k forks source link

Improve Parquet IO Performance within cloud datalakes #2912

Open asfimport opened 1 month ago

asfimport commented 1 month ago

Parquet list/open/read/commit performance can be improved by reducing the amount of storage IO made, and for the IO which does take place, doing it more efficiently.

PARQUET-2171 is the first "cloud-first" performance enhancement for parquet, but there are many more available.

Use Hadoop 3.3+ filesystem APIs when available.

All recent Hadoop FS APIs have been cloud-friendly, e.g. the openFile() call lets the caller pass in file status/length (saves a HEAD) and force random IO as the read policy.

Reporter: Steve Loughran / @steveloughran Assignee: Steve Loughran / @steveloughran

Related issues:

Note: This issue was originally created as PARQUET-2486. Please see the migration documentation for further details.