apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.13k stars 1.16k forks source link

Restore nullability support for consuming Substrait fields. #12727

Open westonpace opened 2 weeks ago

westonpace commented 2 weeks ago

Is your feature request related to a problem or challenge?

In https://github.com/apache/datafusion/pull/11130 the Substrait consumer was changed to always produce nullable fields.

However, I'm not entirely sure of the rationale. From that PR it states:

Arrow requires schema to match exactly to literals. Substrait contains nullability separate in literals and types, and the nullability of a literal may vary row-by-row.

What does "Arrow requires schema to match exactly to literals" mean? Arrow doesn't really have a concept of literals that I'm aware of (neither does arrow-rs). So Arrow shouldn't really care if literals are nullable or not.

It appears that Datafusion maps literals to ScalarValue and there is no such thing in Datafusion as a non-nullable literal. So it makes sense to me that nullability is ignored when consuming literals.

However, arrow-rs and Substrait both have a concept of nullable fields and, as far as I can tell, those concepts are compatible. Can we map Substrait nullability to Arrow field nullability?

Describe the solution you'd like

The method from_substrait_type should either return (DataType, bool) or Field (with an empty name). The method from_substrait_named_struct should return a schema that has the nullability set appropriately in fields.

Non-nullable fields can round-trip from DF->Substrait->DF without losing their non-nullability.

Describe alternatives you've considered

No response

Additional context

No response

westonpace commented 2 weeks ago

CC @Blizzara @waynexia

westonpace commented 2 weeks ago

This is not urgent for me and, if everyone agrees, I will try and get this done myself. I'm hoping to make a few changes to Substrait and just wanted to make sure I understood the reasoning before doing so.

Blizzara commented 2 weeks ago

I'm not 100% sure if I remember this correctly, but to the best of my memory:

What does "Arrow requires schema to match exactly to literals" mean? Arrow doesn't really have a concept of literals that I'm aware of (neither does arrow-rs). So Arrow shouldn't really care if literals are nullable or not.

When we create a relation from a VirtualTable using DF's Values relation, we give it a schema and a vec<vec>, where the expr's are literals. Something in Arrow validates that the exprs match the schema exactly, including nullability, IIRC.

It appears that Datafusion maps literals to ScalarValue and there is no such thing in Datafusion as a non-nullable literal. So it makes sense to me that nullability is ignored when consuming literals.

Some ScalarValues, e.g. List, Struct and Map, contain inner Fields which define nullability. This nullability must match the nullability defined for the field in question in the schema, see above.

I think it should be possible to define the nullability properly, but it requires one of two things, either:

  1. the nullability defined in Substrait for literals must match the nullability defined for the schema, for VirtualRelations at least. This does not currently hold true for e.g. substrait-spark, iirc because Spark doesn't have the concept of nullability as part of the literal (eg. for lists/maps/structs). OR
  2. computing the schema first and then passing it to from_substrait_literal (because again, the nullability has to match for VirtualTable/Values). I think I had some code to do that at some point but ended up not committing it as it was then diverging between the "literal for a virtual table (with schema)" and "literal for other reasons (w/o schema)" cases.

Main thing being that it was difficult to get the nullability to correctly match, and the benefits from specifying things as non-nullable are not that large, so it was easiest to just default to nullable for everything.