`add_files` procedure allows importing NULL on NOT NULL columns

apache / iceberg

Apache Iceberg

https://iceberg.apache.org/

Apache License 2.0

6.23k stars 2.17k forks source link

`add_files` procedure allows importing NULL on NOT NULL columns #10742

Open ebyhr opened 2 months ago

ebyhr commented 2 months ago

Apache Iceberg version

None

Query engine

None

Please describe the bug 🐞

Steps to reproduce:

CREATE TABLE test_hive_null STORED AS PARQUET AS SELECT CAST(NULL AS integer) x;
CREATE TABLE test_iceberg_not_null (x int NOT NULL) USING iceberg;

CALL spark_catalog.system.add_files(table => 'default.test_iceberg_not_null',  source_table => 'default.test_hive_null');

SELECT * FROM default.test_iceberg_not_null;
0

There's no relevant test in TestAddFilesProcedure as far as I confirmed, so I assume it's unexpected behavior.

Willingness to contribute

[ ] I can contribute a fix for this bug independently
[ ] I would be willing to contribute a fix for this bug with guidance from the Iceberg community
[X] I cannot contribute a fix for this bug at this time

nk1506 commented 2 months ago

This bug looks interesting while value it is returning as ZERO. Column metrics it is keeping as null.

{"x":{"column_size":31,"value_count":1,"null_value_count":1,"nan_value_count":null,"lower_bound":null,"upper_bound":null}}

and as expected, it is not allowing do insert operation with null value.

@RussellSpitzer ,Please share your thoughts here. If this is a bug, I would be happy to help resolve it.

RussellSpitzer commented 2 months ago

We don't do any validation on any of the columns during add_files so this is a place where we could add some safety code. So not so much a bug as just an area we haven't really looked at. For example if you add files that don't match the columns of the table, we also just let that happen.

nk1506 commented 2 months ago

IMO, we should add the validation for null check at least. Else it may violate the the table definition constraints. Having extra columns to parquet is being ignored from column metrics. Also after rewrite_data_file those extra columns will be removed. Adding parquet files with null values are always being considered. It can give wrong results for ZERO count. WDYT?