Open Kimahriman opened 2 years ago
Hi @Kimahriman, thanks for bringing this to our attention. We will look into this.
@brkyvz do you remember why we picked up this behavior?
I believe this is the behavior that most databases follow. The idea was that if foo.bar
is not null, then foo
should always be not null. If you want the case that foo
can be null, but whenever foo is not null, then bar should also be not null, then you should be using a check constraint such as:
foo is null or foo.bar is not null
Spark seems to allow for that behavior of foo being nullable but foo.bar not
Spark seems to allow for that behavior of foo being nullable but foo.bar not
@Kimahriman This is done by Delta. I think the current behavior makes sense in the SQL world. Even if foo.bar
is not nullable, it can still be null
if we allow foo
to be nullable. For example, if foo
is null
, SELECT foo.bar
will return null
even if foo.bar
is not nullable. The current behavior will ensure you never see null
returned by foo.bar
.
Yeah I'm just saying this is valid Spark behavior, so you can hit an exception for valid Spark code. Simplest reproduction:
import pyspark.sql.functions as F
from delta.tables import DeltaTable
test = spark.range(1).withColumn('nested', F.when(F.col('id') < 5, F.struct('id')))
test.printSchema()
root
|-- id: long (nullable = false)
|-- nested: struct (nullable = true)
| |-- id: long(nullable = false)
DeltaTable.createOrReplace().location('file:///tmp/test').addColumns(test.schema).execute()
test.write.format('delta').mode('append').save('file:///tmp/test')
And Spark correctly updates the nullability when it needs to:
test.select('nested.*').printSchema()
root
|-- id: long (nullable = true)
@Kimahriman Sorry. I missed your latest update. Thanks for the example. Yep, Spark will update the nullability automatically. However, for Delta, if users set a not-null constraint, they likely will expect they should never see a null value when writing the code (such as write a UDF to handle a nested column value). Hence, we would like to keep the existing behavior.
I'm going to close this as this is the intended behavior. Feel free to reopen this if you have more feedback 😄
if users set a not-null constraint, they likely will expect they should never see a null value when writing the code
That's the main issue, not-null constraints are automatically applied with no way to disable them. I have been slowly playing around with ways update the not-null constraint checking to operate in-line with how Spark works, haven't made too much progress on it though. At the very least it should be an opt-in thing to get around this. I still consider it a bug 😅
Hah, triggered a pipeline failure by commenting on a closed issue I guess
We prefer to avoid creating non standard behaviors. Could you clarify what use cases that may require this non standard behavior?
I would argue that the current behavior is non-standard and this ticket is to make it (spark) standard 😛
Basically what actually triggers this for us is a Scala UDF that takes a nullable string and returns struct. One of the fields in the struct is non-nullable, because it always has a value iff the string input is not null. The end result in Spark is a nullable struct with a non-nullable struct field
it always has a value iff the string input is not null
Can you set the nested field nullable and add a check constraint such as nested is null or nested.id is not null
to the table instead?
Yeah there are workarounds, just less than ideal. You lose the 0.01% speedup you would get by avoiding null-checks on the field hah. I'm less concerned about enforcing the right not-null checks and just being able to opt-in to skip the check all together because I know the nullability is correct
Adding a new behavior even if it's opt-in would also add extra work when we expend the not null constraint behaviors to other engines. And I feel it's not worth right now.
I will keep this open to hear more voice.
I was bitten by this today. I'm using Scala and have this model
case class MyRow(id: Int, value: String)
case class Changes(before: Option[MyRow], after: MyRow)
modelling changes between MyRow
s of the same id. I expected this to work but get org.apache.spark.sql.delta.schema.DeltaInvariantViolationException: NOT NULL constraint violated for column: before.id.
I can work around it but it feels like incorrect behaviour.
Do we know how other engines like Presto/Trino handle nullability of structs? Because Delta essentially doesn't support struct nullability. One option would be when creating a table or new field in the schema to force everything under a nullable struct to be nullable as well. This seems like the behavior Delta expects.
Yep. It would be great if anyone familiar with Presto/Trino can provide the feedback.
An existing test actually tests for what I think is incorrect behavior: https://github.com/delta-io/delta/blob/master/core/src/test/scala/org/apache/spark/sql/delta/schema/InvariantEnforcementSuite.scala#L186
Here the
top
struct is nullable but thekey
field is not, and the current invariant checks only care about the fact thatkey
is non-nullable therefore selecting that value (through a series ofGetStructField
's), will always not be null. However, it is valid fortop
to be null, and it's more accurate to say thatkey
is never null whentop
is not null I think.So in this test case, the first one is a valid test case.
top
is not null butkey
is. However, in the second test case,top
is null, which should be valid behavior and not throw an exception I believe.After looking through the code I can see a few ways to make it basically skip checking
key
in this case, but it might be more ideal but more complicated to have it checktop
first, and only if that's not null, then checkkey
and fail only in that case.