apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.33k stars 1.2k forks source link

fix: Ignore names of technical inner fields (of List and Map types) when comparing datatypes for logical equivalence #13522

Open Blizzara opened 8 hours ago

Blizzara commented 8 hours ago

Which issue does this PR close?

Closes #13437

Rationale for this change

Explained more in the issue, but in short: In Substrait consumer we check schemas of the input dataset and the Substrait input relation using logically_equivalent_names_and_types(..). This then calls datatype_is_logically_equal(..) on all fields, which can fail if the technical inner fields of a list or map have differing names. That happens to be the case when reading lists from parquet, as the parquet reader uses "element" as the name vs DF (incl. the substrait consumer) mostly using "item".

I think this makes sense, since arguably the names here shouldn't matter, and since Arrow doesn't mandate any specific names for these fields, we should ignore them.

What changes are included in this PR?

Ignore technical inner fields' names when comparing data types for logical equivalence.

Arguably we should ignore these names in all equivalence testing. That's a bigger change and might be hard to even enumerate all the places to check, so I only did the minimal thing I need here, but if it'd be preferred, I can try to expand to other cases as well - at least datatype_is_semantically_equal.

Are these changes tested?

Added unit test

Are there any user-facing changes?

No