Explained more in the issue, but in short: In Substrait consumer we check schemas of the input dataset and the Substrait input relation using logically_equivalent_names_and_types(..). This then calls datatype_is_logically_equal(..) on all fields, which can fail if the technical inner fields of a list or map have differing names. That happens to be the case when reading lists from parquet, as the parquet reader uses "element" as the name vs DF (incl. the substrait consumer) mostly using "item".
I think this makes sense, since arguably the names here shouldn't matter, and since Arrow doesn't mandate any specific names for these fields, we should ignore them.
What changes are included in this PR?
Ignore technical inner fields' names when comparing data types for logical equivalence.
Arguably we should ignore these names in all equivalence testing. That's a bigger change and might be hard to even enumerate all the places to check, so I only did the minimal thing I need here, but if it'd be preferred, I can try to expand to other cases as well - at least datatype_is_semantically_equal.
Which issue does this PR close?
Closes #13437
Rationale for this change
Explained more in the issue, but in short: In Substrait consumer we check schemas of the input dataset and the Substrait input relation using
logically_equivalent_names_and_types(..)
. This then callsdatatype_is_logically_equal(..)
on all fields, which can fail if the technical inner fields of a list or map have differing names. That happens to be the case when reading lists from parquet, as the parquet reader uses "element" as the name vs DF (incl. the substrait consumer) mostly using "item".I think this makes sense, since arguably the names here shouldn't matter, and since Arrow doesn't mandate any specific names for these fields, we should ignore them.
What changes are included in this PR?
Ignore technical inner fields' names when comparing data types for logical equivalence.
Arguably we should ignore these names in all equivalence testing. That's a bigger change and might be hard to even enumerate all the places to check, so I only did the minimal thing I need here, but if it'd be preferred, I can try to expand to other cases as well - at least
datatype_is_semantically_equal
.Are these changes tested?
Added unit test
Are there any user-facing changes?
No