Open joeytman opened 5 months ago
Just chiming in that I am also experiencing this same bug using JDBC catalog and Minio. org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:1.6.1
. Either 'files' or 'Files'. Case does not seem to matter.
Table seems to create without issue when using df.writeTo('namespace.files').createOrReplace()
, but it is not queryable from spark. The table files DOES appear in the JDBC iceberg_tables
data.
As a notable datapoint, it works fine to read it from Trino with their iceberg connector (but they have their own unique annoyance, of not accepting tables with uppercase names).
Apache Iceberg version
1.5.2 (latest release)
Query engine
Spark
Please describe the bug 🐞
We have a table in our relational database named
files
. When we ingest the table into our data lake, we would like to be able to keep the namefiles
for the table. However, it seems that tables cannot be namedfiles
,history
, etc.We use hive metastore, and SparkSessionCatalog.
This issue is reproducible via both SparkSQL and via Spark Scala job.
For the spark scala job, we're using this JAR, and running on EMR. We have some code like:
This code works as long as the name of the table is not equal to
files
or other metadata table names. It correctly produces a V2 iceberg table and writes the data. However, when the table is namedfiles
, the table is created in HMS successfully, a metadata file is written, but then the write fails with:I then went into spark-sql CLI on EMR (using EMR-provided JAR for convenience), as follows
From there, I tried to look at the table created by my job vs my other Iceberg tables written by the same job.
For other tables, I could successfully run
SHOW CREATE TABLE
and see the Iceberg v2 table:However, when I tried to query the
files
table, I saw:As if the table was not really an Iceberg table.
I then had the idea to submit the same job with the exact same args, modifying only the table name argument to call the destination table
files_temp
instead offiles
. To my surprise, the job succeeded.From the spark-sql cli, I noticed that I was able to interact with my
files_temp
table as expected:It looked exactly as expected. My hope was that the bug was in my scala spark bootstrap job, and that I could simply rename
files_temp
tofiles
and then it would work. However, renaming the table tofiles
immediately breaks it and reproduces the issue:By renaming the table back to something other than
files
, it is able to be interpreted correctly.After this, I even tried using
SparkCatalog
instead ofSparkSessionCatalog
and when I renamed the table toiceberg.files
,SparkCatalog
was unable to even find the table, as if it was a non-iceberg table: