apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.49k stars 2.24k forks source link

[Bug] Iceberg tables break when they're named any of the metadata table names (e.g. `files`, `history`, `manifests`) #10550

Open joeytman opened 5 months ago

joeytman commented 5 months ago

Apache Iceberg version

1.5.2 (latest release)

Query engine

Spark

Please describe the bug 🐞

We have a table in our relational database named files. When we ingest the table into our data lake, we would like to be able to keep the name files for the table. However, it seems that tables cannot be named files, history, etc.

We use hive metastore, and SparkSessionCatalog.

This issue is reproducible via both SparkSQL and via Spark Scala job.

For the spark scala job, we're using this JAR, and running on EMR. We have some code like:

val catalog = new HiveCatalog()
...
val tableId = TableIdentifier.parse(config.table)
val table = catalog.createTable(tableId, schema, newPartitionSpec(schema), config.tableLocation, config.tableProps)
...
df.withColumns(extraCols.toMap)
      .writeTo(config.table)
      .options(config.icebergProps)
      .overwritePartitions()

This code works as long as the name of the table is not equal to files or other metadata table names. It correctly produces a V2 iceberg table and writes the data. However, when the table is named files, the table is created in HMS successfully, a metadata file is written, but then the write fails with:

User class threw exception: org.apache.spark.sql.AnalysisException: Cannot write into v1 table: `spark_catalog`.`iceberg`.`files`

I then went into spark-sql CLI on EMR (using EMR-provided JAR for convenience), as follows

spark-sql \
--conf spark.sql.extensions=org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions \
--conf spark.sql.catalog.spark_catalog=org.apache.iceberg.spark.SparkSessionCatalog \
--conf spark.sql.catalog.spark_catalog.type=hive \
--conf spark.sql.catalog.spark_catalog.io-impl=org.apache.iceberg.aws.s3.S3FileIO \
--conf spark.sql.catalog.spark_catalog.warehouse=s3://redacted/redacted \
--conf spark.jars=/usr/share/aws/iceberg/lib/iceberg-spark3-runtime.jar

From there, I tried to look at the table created by my job vs my other Iceberg tables written by the same job.

For other tables, I could successfully run SHOW CREATE TABLE and see the Iceberg v2 table:

spark-sql (default)> show create table iceberg.users;
CREATE TABLE spark_catalog.iceberg.users (
...
 <the rest of the statement looks normal for a v2 iceberg table>
...

However, when I tried to query the files table, I saw:

spark-sql (default)> show create table iceberg.files;
Failed to execute SHOW CREATE TABLE against table files, which is created by Hive and uses the following unsupported serde configuration
 SERDE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe INPUTFORMAT: org.apache.hadoop.mapred.FileInputFormat OUTPUTFORMAT: org.apache.hadoop.mapred.FileOutputFormat
Please use `SHOW CREATE TABLE files AS SERDE` to show Hive DDL instead.

As if the table was not really an Iceberg table.

I then had the idea to submit the same job with the exact same args, modifying only the table name argument to call the destination table files_temp instead of files. To my surprise, the job succeeded.

From the spark-sql cli, I noticed that I was able to interact with my files_temp table as expected:

spark-sql (default)> show create table iceberg.files_temp;
CREATE TABLE spark_catalog.iceberg.files_temp (
...
 <the rest of the statement looks normal for a v2 iceberg table>
...

It looked exactly as expected. My hope was that the bug was in my scala spark bootstrap job, and that I could simply rename files_temp to files and then it would work. However, renaming the table to files immediately breaks it and reproduces the issue:

spark-sql (default)> alter table iceberg.files_temp rename to iceberg.files;

spark-sql (default)> show create table iceberg.files;
Failed to execute SHOW CREATE TABLE against table files, which is created by Hive and uses the following unsupported serde configuration
 SERDE: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe INPUTFORMAT: org.apache.hadoop.mapred.FileInputFormat OUTPUTFORMAT: org.apache.hadoop.mapred.FileOutputFormat
Please use `SHOW CREATE TABLE files AS SERDE` to show Hive DDL instead.

By renaming the table back to something other than files, it is able to be interpreted correctly.

spark-sql (default)> alter table iceberg.files rename to iceberg.files_temp;

spark-sql (default)> show create table iceberg.files_temp;
CREATE TABLE spark_catalog.iceberg.files_temp (
...
 <the rest of the statement looks normal for a v2 iceberg table>
...

After this, I even tried using SparkCatalog instead of SparkSessionCatalog and when I renamed the table to iceberg.files, SparkCatalog was unable to even find the table, as if it was a non-iceberg table:

spark-sql (default)> show create table iceberg.files_temp;
CREATE TABLE spark_catalog.iceberg.files_temp (
...
 <the rest of the statement looks normal for a v2 iceberg table>
...

spark-sql (default)> alter table iceberg.files_temp rename to iceberg.files;
Time taken: 0.141 seconds

spark-sql (default)> show create table iceberg.files;
[TABLE_OR_VIEW_NOT_FOUND] The table or view `iceberg`.`files` cannot be found. Verify the spelling and correctness of the schema and catalog.
If you did not qualify the name with a schema, verify the current_schema() output, or qualify the name with the correct schema and catalog.
To tolerate the error on drop use DROP VIEW IF EXISTS or DROP TABLE IF EXISTS.; line 1 pos 18;
'ShowCreateTable false, [createtab_stmt#101]
+- 'UnresolvedTableOrView [iceberg_dev, files], SHOW CREATE TABLE, false
blakelivingston commented 12 hours ago

Just chiming in that I am also experiencing this same bug using JDBC catalog and Minio. org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1. Either 'files' or 'Files'. Case does not seem to matter.

Table seems to create without issue when using df.writeTo('namespace.files').createOrReplace(), but it is not queryable from spark. The table files DOES appear in the JDBC iceberg_tables data.

As a notable datapoint, it works fine to read it from Trino with their iceberg connector (but they have their own unique annoyance, of not accepting tables with uppercase names).