apache / parquet-java

Apache Parquet Java
https://parquet.apache.org/
Apache License 2.0
2.65k stars 1.41k forks source link

Could we provide a potentially larger InternalParquetRecordWriter.getDataSize #3032

Open dragongu opened 1 month ago

dragongu commented 1 month ago

The following code currently has getDataSize as an estimated value. The Iceberg rolling file write operation relies on this method, which may result in writing files that are much smaller than expected.

/**
 * @return the total size of data written to the file and buffered in memory
 */
public long getDataSize() {
  return lastRowGroupEndPos + columnStore.getBufferedSize();
}

Could we provide a potentially larger getDataSize? I can't think of any downsides at the moment.

Component(s)

No response

wgtmac commented 1 month ago

Do you have any concrete suggestion on what value to provide? My concern is that changing the behavior may affect a lot of downstream applications in the wild without notice.