delta-io / delta-rs

A native Rust library for Delta Lake, with bindings into Python
https://delta-io.github.io/delta-rs/
Apache License 2.0
1.98k stars 365 forks source link

fix: change map root type to comply with https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps to use key_value instead of entries #2538

Open sclmn opened 1 month ago

sclmn commented 1 month ago

Description

Change root field name for map to key_value

Related Issue(s)

https://github.com/delta-io/delta-rs/pull/2182

Documentation

https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps

github-actions[bot] commented 1 month ago

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

roeap commented 1 month ago

@sclmn - thanks for opening this PR.

Would you mind elaborating a bit on the use case / problem you are encountering? Main reason I am asking is that we have been struggling a bit to come up with a system that would be less opinionated about "internal" field names.

Here we are not really configuring names in parquet, but rather arrow, which may be written to parquet. In the arrow world, there is a difference between implementations which name is used, and it actually sometimes does not even matter.

In principle I would be fine merging this, but I would not want to promise, that this name stays like this indefinitely, as it conflicts with arrow, where we have a larger surface area.

sclmn commented 1 month ago

@sclmn - thanks for opening this PR.

Would you mind elaborating a bit on the use case / problem you are encountering? Main reason I am asking is that we have been struggling a bit to come up with a system that would be less opinionated about "internal" field names.

Here we are not really configuring names in parquet, but rather arrow, which may be written to parquet. In the arrow world, there is a difference between implementations which name is used, and it actually sometimes does not even matter.

In principle I would be fine merging this, but I would not want to promise, that this name stays like this indefinitely, as it conflicts with arrow, where we have a larger surface area.

The delta lake protocol defines the type of various fields as map (eg partitionValues) and since the format of the checkpoint is parquet we expect key_value. If you look at the link above it states:

_The outer-most level must be a group annotated with MAP that contains a single field named keyvalue. The repetition of this level must be either optional or required and determines whether the list is nullable.

Various code validates this to some degree (eg arrow-rs ).

roeap commented 1 month ago

certainly, but I believe we may be mixing things up here a bit, so just making sure we are talking about the same things here.

SO the thing we want to change in this PR, only controls how data is represented in memory. The name entries is in accordance with how arrow-rs models map arrays - albeit I think arrow-rs does not require that specific name. IIRC, different arrow implementations made different choices.

SO ultimately I believe it is the responsibility of the parquet writer, to make sure that the "Inner" map field gets the right name according to the spec.

That said, we have seen several instances now, where the mismatch in conventions how these inner fields are named (also applied to the varies lists variants in parquet etc.) became quite bothersome.

As such I'd be keen to understand a little bit better where you are hitting an error based on this name to finally find a sustainable solution how we can handle the various schema conversions.

sclmn commented 1 month ago

I'm trying to fix the on disk result (working on supporting this at BigQuery). The current code generates delta lake checkpoints that have an improper parquet schema. For instance, the tags field looks like: optional group field_id=-1 tags (Map) { repeated group field_id=-1 entries { required binary field_id=-1 key (String); optional binary field_id=-1 value (String); } }

emkornfield commented 1 month ago

In principle I would be fine merging this, but I would not want to promise, that this name stays like this indefinitely, as it conflicts with arrow, where we have a larger surface area.

In regards to this, the Arrow specification is a lot more lenient on the the layout of maps than parquet.

@alamb @tustvold are there any flags in Parquet-rs to make it transparently write map types with compliant naming?

alamb commented 1 month ago

@alamb @tustvold are there any flags in Parquet-rs to make it transparently write map types with compliant naming?

Not that I know of. I think it will typically write whatever is in the arrow schema