Open wgtmac opened 2 weeks ago
BTW, can you also point out the Java code in this pr?
Other LGTM but I think it worths issue a disscussion...
I have sent a discussion thread to the dev ML. It would be good if you can take a look. Thanks! @emkornfield @pitrou @gszadovszky @rdblue @etseidl @clairemcginty
There is another issue at https://github.com/apache/parquet-format/blob/master/LogicalTypes.md?plain=1#L696-L712.
MAP
is used to annotate types that should be interpreted as a map from keys to values.MAP
must annotate a 3-level structure:<map-repetition> group <name> (MAP) { repeated group key_value { required <key-type> key; <value-repetition> <value-type> value; } }
The outer-most level must be a group annotated with
MAP
that contains a single field namedkey_value
. The repetition of this level must be eitheroptional
orrequired
and determines whether the list is nullable.
Does it mean that MAP cannot be nested in two-level encoded LIST? So the 1st one below is valid and the 2nd one is invalid. What should reader implementation do if reading from a Parquet file with the latter schema? @emkornfield @gszadovszky
List<Map<String, String>> in three-level list encoding:
optional group my_list (LIST) {
repeated group list {
required group element (MAP) {
repeated group key_value {
required binary key (STRING);
optional binary value (STRING);
}
}
}
}
List<Map<String, String>> in two-level list encoding:
optional group my_list (LIST) {
repeated group array (MAP) {
repeated group key_value {
required binary key (STRING);
optional binary value (STRING);
}
}
}
Similarly, the schema below should also be invalid according to the spec:
// List<List<String>>: outer list is two-level encoding, inner list is three-level encoding
optional group my_list (LIST) {
repeated group array (LIST) {
repeated group list {
required binary element (STRING);
}
}
}
I have addressed the comments. Could you help review it again? @rdblue @gszadovszky
Does it mean a file schema shall not contain both types of lists or that we shall not use annotated lists inside an unannotated one or vice-versa? @rdblue, WDYT?
I think this is addressed here. The intent was to not allow mixing list types within a schema, so I think all of the above.
Does it mean a file schema shall not contain both types of lists or that we shall not use annotated lists inside an unannotated one or vice-versa? @rdblue, WDYT?
I think this is addressed here. The intent was to not allow mixing list types within a schema, so I think all of the above.
Thanks a lot, @etseidl. @wgtmac, maybe, we should articulate this to be 100% clear that at this part we are talking about the entire schema.
Apologies for muddying the waters, but I'm still trying to get this all clear in my head. I'm wondering if rather than adding a new rule, can we simply modify Rule 3 to say
If the repeated field is a group with one field, is named either "array" or uses the LIST-annotated
group's name with "_tuple" appended, and would otherwise be a valid 3-level structure as outlined
above, then the repeated type is the element type and elements are required.
The form that triggered all this discussion
optional group a (LIST) {
repeated group array (LIST) {
repeated int32 array;
}
}
is interpreted by some readers using Rule 3 because the repeated group is named array
. If Rule 3 is modified as above, this example would not trigger Rule 3 due to a) the repeated
repetition on the inner-most field, and b) the LIST
annotation on the repeated group. Using Rule 4, the element type is the repeated field's type, which in this case is a Rule 1 primitive list, yielding a nested 2-level list (List<List<Integer>>
) with non-nullable elements at both levels.
Without the LIST
annotation on the repeated group, the above becomes
optional group a (LIST) {
repeated group array {
repeated int32 array;
}
}
which is not valid at all. It cannot be a 3-level nor Rule 3 list due to the repeated
inner element field. Rule 4 could be construed to say this is a List<OneTuple<List<Integer>>
, but that would require mixing LIST
annotated and unannotated lists, which has already been forbidden earlier in the specification.
@rdblue objected to an earlier draft similar to what I'm proposing, but the new Rule 3 only works because of the LIST
annotation on the repeated group, not because the innermost field is repeated
.
@etseidl I think we are on the same page that current rule 3 has an issue that it does not forbid the inner-most layer to be repeated
. However, I feel that it is not a good idea to add would otherwise be a valid 3-level structure as outlined above
to rule 3. The reason is that a normal 3-level structure can simply ignore its synthetic middle layer but the middle layer produces a OneTuple
in the rule 3, which may also confuse readers. In addition, it is not that straight-forward. So I agree with @rdblue that it would be good to insert a new rule for the new case. IMO the new rule should be placed above the current rule 3 so we are pretty sure that the inner layer cannot be repeated
in the remaining cases (including the rule 3). The only caveat is that the new rule 5 seems to be unreachable.
Ok, so the new Rule 3 is in effect saying that if the child of the repeated middle group is repeated, then because 1-level and LIST/MAP annotated schemas cannot mix, the middle group cannot be the type so the child is the type. The side consequence of this is that the middle group has to have a LIST
or MAP
annotation to allow the 3rd level to be repeated
.
So is it true then that the repeated
keyword may only be used for the child of a LIST
or MAP
annotated field when those annotations are in use? And if that's true, then a 2-level structure cannot be repeated
as stated above?
In contrast to 3-level structure, the repetition of 2-level structure can be `optional`, `required`, or `repeated`.
Ok, so the new Rule 3 is in effect saying that if the child of the repeated middle group is repeated, then because 1-level and LIST/MAP annotated schemas cannot mix, the middle group cannot be the type so the child is the type. The side consequence of this is that the middle group has to have a LIST or MAP annotation to allow the 3rd level to be repeated
No, the middle level can never be a LIST-annotated or MAP-annotated 3-level structure, because they cannot be repeated
. As you have said, the middle level cannot be an unannotated group with a single repeated field due to mixing LIST-annotation and repeated field without annotation. So the only valid case for rule 3 is that the middle level is a LIST-annotated 2-level structure.
So is it true then that the repeated keyword may only be used for the child of a LIST or MAP annotated field when those annotations are in use? And if that's true, then a 2-level structure cannot be repeated as stated above?
You're right. A LIST-annotated 2-level structure can only be repeated when it is a direct child of another LIST-annotated 2-level structure. I had missed the statement that disallows annotated and unannotated types to be mixed and modified these rules back and forth. Now this is pretty clear. Let me reflect these discussions in the PR.
Rationale for this change
The C++ reader of parquet-cpp is having a hard time to read Parquet file written by parquet-java with
parquet.avro.write-old-list-structure=true
and schema below:See https://github.com/apache/arrow/issues/43994 and https://github.com/apache/arrow/pull/43995
What changes are included in this PR?
Clarify the rules and add an example.
Do these changes have PoC implementations?
Not required
Closes #465