apache / parquet-format

Apache Parquet Format
https://parquet.apache.org/
Apache License 2.0
1.81k stars 432 forks source link

GH-465: Clarify backward-compatibility rules on LIST type #466

Open wgtmac opened 2 weeks ago

wgtmac commented 2 weeks ago

Rationale for this change

The C++ reader of parquet-cpp is having a hard time to read Parquet file written by parquet-java with parquet.avro.write-old-list-structure=true and schema below:

optional group a (LIST) {
  repeated group array (LIST) {
    repeated int32 array;
  }
}

See https://github.com/apache/arrow/issues/43994 and https://github.com/apache/arrow/pull/43995

What changes are included in this PR?

Clarify the rules and add an example.

Do these changes have PoC implementations?

Not required

Closes #465

mapleFU commented 2 weeks ago

BTW, can you also point out the Java code in this pr?

mapleFU commented 2 weeks ago

Other LGTM but I think it worths issue a disscussion...

wgtmac commented 2 weeks ago

I have sent a discussion thread to the dev ML. It would be good if you can take a look. Thanks! @emkornfield @pitrou @gszadovszky @rdblue @etseidl @clairemcginty

wgtmac commented 2 weeks ago

There is another issue at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md?plain=1#L696-L712.

MAP is used to annotate types that should be interpreted as a map from keys to values. MAP must annotate a 3-level structure:

<map-repetition> group <name> (MAP) {
  repeated group key_value {
    required <key-type> key;
    <value-repetition> <value-type> value;
  }
}

The outer-most level must be a group annotated with MAP that contains a single field named key_value. The repetition of this level must be either optional or required and determines whether the list is nullable.

Does it mean that MAP cannot be nested in two-level encoded LIST? So the 1st one below is valid and the 2nd one is invalid. What should reader implementation do if reading from a Parquet file with the latter schema? @emkornfield @gszadovszky

 List<Map<String, String>> in three-level list encoding:
 optional group my_list (LIST) {
   repeated group list {
     required group element (MAP) {
       repeated group key_value {
         required binary key (STRING);
         optional binary value (STRING);
       }
     }
   }
 }

 List<Map<String, String>> in two-level list encoding:
 optional group my_list (LIST) {
   repeated group array (MAP) {
     repeated group key_value {
       required binary key (STRING);
       optional binary value (STRING);
     }
   }
 }

Similarly, the schema below should also be invalid according to the spec:

// List<List<String>>: outer list is two-level encoding, inner list is three-level encoding

optional group my_list (LIST) {
  repeated group array (LIST) {
    repeated group list {
      required binary element (STRING);
    }
  }
}
wgtmac commented 6 days ago

I have addressed the comments. Could you help review it again? @rdblue @gszadovszky

etseidl commented 5 days ago

Does it mean a file schema shall not contain both types of lists or that we shall not use annotated lists inside an unannotated one or vice-versa? @rdblue, WDYT?

I think this is addressed here. The intent was to not allow mixing list types within a schema, so I think all of the above.

gszadovszky commented 5 days ago

Does it mean a file schema shall not contain both types of lists or that we shall not use annotated lists inside an unannotated one or vice-versa? @rdblue, WDYT?

I think this is addressed here. The intent was to not allow mixing list types within a schema, so I think all of the above.

Thanks a lot, @etseidl. @wgtmac, maybe, we should articulate this to be 100% clear that at this part we are talking about the entire schema.

etseidl commented 3 days ago

Apologies for muddying the waters, but I'm still trying to get this all clear in my head. I'm wondering if rather than adding a new rule, can we simply modify Rule 3 to say

If the repeated field is a group with one field, is named either "array" or uses the LIST-annotated 
group's name with "_tuple" appended, and would otherwise be a valid 3-level structure as outlined 
above, then the repeated type is the element type and elements are required.

The form that triggered all this discussion

optional group a (LIST) {
  repeated group array (LIST) {
    repeated int32 array;
  }
}

is interpreted by some readers using Rule 3 because the repeated group is named array. If Rule 3 is modified as above, this example would not trigger Rule 3 due to a) the repeated repetition on the inner-most field, and b) the LIST annotation on the repeated group. Using Rule 4, the element type is the repeated field's type, which in this case is a Rule 1 primitive list, yielding a nested 2-level list (List<List<Integer>>) with non-nullable elements at both levels.

Without the LIST annotation on the repeated group, the above becomes

optional group a (LIST) {
  repeated group array {
    repeated int32 array;
  }
}

which is not valid at all. It cannot be a 3-level nor Rule 3 list due to the repeated inner element field. Rule 4 could be construed to say this is a List<OneTuple<List<Integer>>, but that would require mixing LIST annotated and unannotated lists, which has already been forbidden earlier in the specification.

@rdblue objected to an earlier draft similar to what I'm proposing, but the new Rule 3 only works because of the LIST annotation on the repeated group, not because the innermost field is repeated.

wgtmac commented 1 day ago

@etseidl I think we are on the same page that current rule 3 has an issue that it does not forbid the inner-most layer to be repeated. However, I feel that it is not a good idea to add would otherwise be a valid 3-level structure as outlined above to rule 3. The reason is that a normal 3-level structure can simply ignore its synthetic middle layer but the middle layer produces a OneTuple in the rule 3, which may also confuse readers. In addition, it is not that straight-forward. So I agree with @rdblue that it would be good to insert a new rule for the new case. IMO the new rule should be placed above the current rule 3 so we are pretty sure that the inner layer cannot be repeated in the remaining cases (including the rule 3). The only caveat is that the new rule 5 seems to be unreachable.

etseidl commented 13 hours ago

Ok, so the new Rule 3 is in effect saying that if the child of the repeated middle group is repeated, then because 1-level and LIST/MAP annotated schemas cannot mix, the middle group cannot be the type so the child is the type. The side consequence of this is that the middle group has to have a LIST or MAP annotation to allow the 3rd level to be repeated.

So is it true then that the repeated keyword may only be used for the child of a LIST or MAP annotated field when those annotations are in use? And if that's true, then a 2-level structure cannot be repeated as stated above?

In contrast to 3-level structure, the repetition of 2-level structure can be `optional`, `required`, or `repeated`.
wgtmac commented 6 hours ago

Ok, so the new Rule 3 is in effect saying that if the child of the repeated middle group is repeated, then because 1-level and LIST/MAP annotated schemas cannot mix, the middle group cannot be the type so the child is the type. The side consequence of this is that the middle group has to have a LIST or MAP annotation to allow the 3rd level to be repeated

No, the middle level can never be a LIST-annotated or MAP-annotated 3-level structure, because they cannot be repeated. As you have said, the middle level cannot be an unannotated group with a single repeated field due to mixing LIST-annotation and repeated field without annotation. So the only valid case for rule 3 is that the middle level is a LIST-annotated 2-level structure.

So is it true then that the repeated keyword may only be used for the child of a LIST or MAP annotated field when those annotations are in use? And if that's true, then a 2-level structure cannot be repeated as stated above?

You're right. A LIST-annotated 2-level structure can only be repeated when it is a direct child of another LIST-annotated 2-level structure. I had missed the statement that disallows annotated and unannotated types to be mixed and modified these rules back and forth. Now this is pretty clear. Let me reflect these discussions in the PR.