apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.37k stars 2.2k forks source link

Cannot write nullable values to non-null column in the Iceberg Table #9488

Open abharath9 opened 9 months ago

abharath9 commented 9 months ago

Throwing following error when trying into insert data into the Iceberg table with not-null columns constraints. Cannot write nullable values to non-null column 'id' exception

Versions: Spark Version 3.2.0 and Iceberg Version 0.13.0

Here is a sample code to reproduce this issue.

from pyspark.conf import SparkConf
from pyspark.context import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, concat, when, countDistinct, lit, size
from pyspark.sql.types import StringType
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, ArrayType

conf = ( 
    SparkConf()
    .setAppName("analyst-hero-spark-write-iceberg")
    .set("spark.sql.adaptive.coalescePartitions.enabled","true") 
    .set("spark.sql.parquet.filterPushdown","true")
    .set("spark.hadoop.fs.s3a.impl.disable.cache","false")
    .set("spark.sql.catalog.spark_catalog", 'org.apache.iceberg.spark.SparkSessionCatalog')
    .set("spark.sql.extensions", 'org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions')
    .set("spark.hadoop.iceberg.hive.lock-timeout-ms", '180')
    .set("spark.hadoop.iceberg.delete-files-on-table-drop", 'true')
    .set("spark.hadoop.iceberg.mr.commit.table.thread.pool.size", '1')
    .set("spark.hadoop.iceberg.mr.commit.file.thread.pool.size", '1')
    .set("spark.hadoop.iceberg.mr.commit.retry.num-retries", '300')
    .set("spark.sql.iceberg.check-nullability", 'false')
)

spark = SparkSession.builder.enableHiveSupport().config(conf=conf).getOrCreate()

spark.sql("CREATE TABLE IF NOT EXISTS sandbox.iceberg_table_mandatory_column_test (id string not null,name string,nationality string) USING iceberg")

spark.sql("show create table sandbox.iceberg_table_mandatory_column_test").select("createtab_stmt").show(truncate=False)

from pyspark.sql import Row
flatData = spark.createDataFrame([Row(id='100', name='bob', nationality='welsh'),Row(id='200', name='john', nationality='british')])

flatData.show(truncate=False)

Tried to insert the data using sql-insert into and also using Dataframe API insertInto both are returning Cannot write nullable values to non-null column 'id' exception

flatData.createOrReplaceTempView("test_data")
spark.sql("insert into sandbox.iceberg_table_mandatory_column_test  select * from test_data")

or

flatData.write.insertInto("sandbox.iceberg_table_mandatory_column_test")

Complete stack trace in the attachment.

error-iceberg-test.txt

nastra commented 9 months ago

@abharath9 it's likely that select * from test_data contains null values, which you are then trying to insert into id which is defined as non-null.

abharath9 commented 9 months ago

@nastra Yes i am aware of that. How do i write optional fields data to the mandatory fields data. It is mentioned in this issue that it is possible by setting "spark.sql.iceberg.check-nullability", 'false' in spark config but it is not working. I wonder if there's anything that i am missing or any other config to add. https://github.com/apache/iceberg/pull/514

Just FYI, the above code is working fine if i upgrade to spark 3.5 and iceberg 1.4.0. I want to get this working with Spark 3.2.0 and Iceberg 0.13.0.

Thanks and appreciate any help

nastra commented 9 months ago

Sorry it wasn't clear from the description what the goal was and I overlooked the usage of spark.sql.iceberg.check-nullability.

It's difficult to tell why it doesn't work with Iceberg 0.13.0 at this point (because numerous features/enhancements have been released in the meantime). We typically would recommend to use a newer Spark/Iceberg version that is maintained

github-actions[bot] commented 1 week ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.