apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.81k stars 432 forks source link

GH-452: Clarify use of RowGroup.ordinal field #453

Closed ggershinsky closed 1 month ago

ggershinsky commented 1 month ago

Encrypted files use three types of ordinals: row group, column, page. All three are simple local counters in both writers and readers. In addition, the row group ordinal is stored in the parquet footer (RowGroup.ordinal field). Parquet implementors can benefit from a clarification on the reason for and intended use of this field.

ggershinsky commented 1 month ago

cc @mapleFU @pitrou

ggershinsky commented 1 month ago

Just curious:

  1. If multiple files being merged or something, would this being merged with same id, or should this being rewritten?

Each encrypted parquet file has a unique file id , used for signing every module of the file (to ensure they are not swapped, etc). Also, each file typically has a unique encryption key. Therefore, a merged file needs a new id, new row group ordinals, a new key; and re-encryption of each module with the new key / AAD.

  1. Is this only required when aad suffix?

Row group ordinal is a part of the AAD suffix in most modules