apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.23k stars 2.17k forks source link

`add_files` procedure allows importing NULL on NOT NULL columns #10742

Open ebyhr opened 2 months ago

ebyhr commented 2 months ago

Apache Iceberg version

None

Query engine

None

Please describe the bug 🐞

Steps to reproduce:

CREATE TABLE test_hive_null STORED AS PARQUET AS SELECT CAST(NULL AS integer) x;
CREATE TABLE test_iceberg_not_null (x int NOT NULL) USING iceberg;

CALL spark_catalog.system.add_files(table => 'default.test_iceberg_not_null',  source_table => 'default.test_hive_null');

SELECT * FROM default.test_iceberg_not_null;
0

There's no relevant test in TestAddFilesProcedure as far as I confirmed, so I assume it's unexpected behavior.

Willingness to contribute

nk1506 commented 2 months ago

This bug looks interesting while value it is returning as ZERO. Column metrics it is keeping as null.

{"x":{"column_size":31,"value_count":1,"null_value_count":1,"nan_value_count":null,"lower_bound":null,"upper_bound":null}}

and as expected, it is not allowing do insert operation with null value.

@RussellSpitzer ,Please share your thoughts here. If this is a bug, I would be happy to help resolve it.

RussellSpitzer commented 2 months ago

We don't do any validation on any of the columns during add_files so this is a place where we could add some safety code. So not so much a bug as just an area we haven't really looked at. For example if you add files that don't match the columns of the table, we also just let that happen.

nk1506 commented 2 months ago

IMO, we should add the validation for null check at least. Else it may violate the the table definition constraints. Having extra columns to parquet is being ignored from column metrics. Also after rewrite_data_file those extra columns will be removed. Adding parquet files with null values are always being considered. It can give wrong results for ZERO count. WDYT?