delta-io / delta

An open-source storage framework that enables building a Lakehouse architecture with compute engines including Spark, PrestoDB, Flink, Trino, and Hive and APIs
https://delta.io
Apache License 2.0
7.52k stars 1.69k forks source link

[BUG] Non-Null Columns Implicitly Interpreted As Invariant #2006

Open emrowirshi opened 1 year ago

emrowirshi commented 1 year ago

Bug

Which Delta project/connector is this regarding?

Describe the problem

We have a table whose schema includes a non-nullable column (example below), and does not explicitly specify any invariants. This table is readable via Spark up until we set the reader version to 3 and the writer version to 7. At that point, reads fail because invariants are not listed in the protocol's writerFeatures, despite the fact that the table doesn't explicitly specify invariants (error code/stack below). Looking at the code, it appears that non-null constraints in the schema are being conflated with invariants, and are then failing a downstream check that if invariants are used, they must also be specified in writerFeatures.

Steps to reproduce

  1. Publish log with non-nullable column, and without invariants:
    
    {"protocol":{"minReaderVersion":1,"minWriterVersion":1}}
    {"metaData":{"id":"92F69D3B-ECE2-494A-BD3F-9683AEE759D9","name":"LunchMenu","format":{"provider":"parquet","options":{}},"schemaString":"{\"type\":\"struct\",\"fields\":[{\"name\":\"Dish\",\"type\":\"string\",\"nullable\":false,\"metadata\":{}}]}","partitionColumns":[],"configuration":{}}}

2. Update the reader and writer versions.
`{"protocol":{"minReaderVersion":3,"minWriterVersion":7,"readerFeatures":["deletionVectors"],"writerFeatures":["deletionVectors"]}}`

3. Attempt to read the table.

#### Observed results

`DeltaTableFeatureException: Unable to operate on this table because the following table features are enabled in metadata but not listed in protocol: invariants.`

#### Expected results

The table should still be readable, as invariants are not used.

#### Further details

Loose code tracing below, just based on some initial digging in the implementation. 

This is the call which produces [InvariantsTableFeature](https://github.com/delta-io/delta/blob/9726f7f2e3922ecdbe4c3c9a0a1c11174193d14d/spark/src/main/scala/org/apache/spark/sql/delta/TableFeature.scala#L420).

And in the definition of [getFromSchema](https://github.com/delta-io/delta/blob/9726f7f2e3922ecdbe4c3c9a0a1c11174193d14d/spark/src/main/scala/org/apache/spark/sql/delta/constraints/Invariants.scala#L73), not null constraints are returned as invariants.

That likely flows up through the [allSupportedFeaturesMap](https://github.com/delta-io/delta/blob/9726f7f2e3922ecdbe4c3c9a0a1c11174193d14d/spark/src/main/scala/org/apache/spark/sql/delta/TableFeature.scala#L323) which is referenced in [extractAutomaticallyEnabledFeatures](https://github.com/delta-io/delta/blob/9726f7f2e3922ecdbe4c3c9a0a1c11174193d14d/spark/src/main/scala/org/apache/spark/sql/delta/actions/actions.scala#L269C7-L269C42).

### Environment information

* Delta Lake version: 2.4
* Spark version: 3.4
* Scala version: 2.12

### Willingness to contribute

The Delta Lake Community encourages bug fix contributions. Would you or another member of your organization be willing to contribute a fix for this bug to the Delta Lake code base?

- [ ] Yes. I can contribute a fix for this bug independently.
- [ ] Yes. I would be willing to contribute a fix for this bug with guidance from the Delta Lake community.
- [x] No. I cannot contribute a bug fix at this time.
allisonport-db commented 1 year ago

@xupefei Can you take a look at this?

xupefei commented 1 year ago

Publish log with non-nullable column, and without invariants

How you did this? Protocol(1, 1) does not support non-null columns. You must have at least Protocol(1, 2).

emrowirshi commented 1 year ago

Understood - that's likely a bug in my implementation, though it seems like it could also be orthogonal to the problem here. I'll test on my end with a writer version of 2, but since the issue is on other engines reading our logs (and not writing to the table), it seems possible there's still a bug here with non-nulls being treated as column invariants. Is there any chance you can follow up on the code pointers above?

xupefei commented 1 year ago

Non-null a is a column invariant. Compared to other invariant rules, not-null is not stored in delta.invariants metadata but as a separate key "nullable":false.

In Delta we set the default protocol to (1, 2), which has been supported by almost all readers. I would suggest you to also use (1, 2) in your implementation.

emrowirshi commented 1 year ago

Got it - thanks for the clarification. Would it be worth it in this case to update the spec, since it's not entirely clear that"nullable": false requires column invariants to be listed in the protocol's features in order to be used?

Another concern is that readers are now enforcing writerFeatures - isn't that too harsh an assert to throw if Spark is acting solely as a reader and not as a writer?