Iceberg table not able to read data from S3 after few hours using Athena .

ajsalunkhe commented 7 months ago

Apache Iceberg version

1.2.0

Query engine

Athena

Please describe the bug 🐞

I am able to write data to iceberg and post that if I run a SELECT query the data is displayed correctly. However when I run the same query on the next day, there are no records to display(even after doing a AS OF timestamp query).

I checked on the s3 location and the data snapshot still exist there but not fetched via Athena query.

The table is there in GlueCatalog with Type as Iceberg. Also tried to drop and recreate table still same problem exist. Table was created using Athena boto3 client.

Any help will be appreciated.

Thanks !! !

nastra commented 7 months ago

@ajsalunkhe it's not clear from the description whether this is an Iceberg or an Athena-related issue. Can you elaborate and add a few more details which queries you ran exactly that resulted in no records?

You could try and run https://iceberg.apache.org/docs/latest/spark-queries/#snapshots (SELECT * FROM catalog.ns.table.snapshots) to see whether the snapshots changed on that table.

ajsalunkhe commented 7 months ago

We are creating iceberg table using below sql query using Athena in Glue Catalog:

CREATE TABLE db_name.table_name(col1 string, col2 string, col3 string) PARTITIONED BY (col2,col3) LOCATION 's3://location' TBLPROPERTIES ('table_type'='ICEBERG','format'='parquet', 'write_compression'='snappy')

Please let me know if anything is missing in creation of table. Our finding is table is visible in glue catalog also the data is sitting on s3 but no data is returned by Athena query or even via EMR.

Checked for snapshots when data was loaded and visible immediately - we were able to see snapshots.

+--------------------+-------------------+-------------------+---------+--------------------+--------------------+ | committed_at| snapshot_id| parent_id|operation| manifest_list| summary| +--------------------+-------------------+-------------------+---------+--------------------+--------------------+ |2024-02-08 06:37:...|6359389328078104972| null|overwrite|s3:/tru.. .|{spark.app.id -> ...| |2024-02-08 07:05:...| 251915350753279685|6359389328078104972|overwrite |s3://tru...|{spark.app.id -> ...| +--------------------+-------------------+-------------------+---------+--------------------+--------------------+

When data is not visible after few hours the result shows no snapshots history.

nastra commented 7 months ago

@ajsalunkhe at this point it's difficult to say what's going wrong here. Do you have any particular bucket policies enabled or anything else that would clean up files?

ajsalunkhe commented 7 months ago

The files don't get cleaned up, I can still see them at S3 location, while the table doesn't loads it on querying.

nastra commented 7 months ago

It seems suspicious that there are no snapshots anymore. Iceberg writes a new snapshot on any operations that would modify data. Can you please share your full catalog configuration? I don't know exactly how to help at this point, since this doesn't seem related to Iceberg itself as Iceberg doesn't just lose data, so something else must be happening.

ajsalunkhe commented 7 months ago

Just to share more details, using below pyspark code to write data to Iceberg table on EMR cluster.

dataframe.writeTo("..").overwritePartitions()

sham-hq commented 2 months ago

Hi, We investigated this and found that, there is one AWS sync job running by Admin team which syncs the Glue Catalog tables with central Catalog. And during that sync, that job removes the Iceberg table type from Glue Catalog.

So, all data and metadata is available at S3, but table forgets its identity as Iceberg table. And because of this, queries against table become unresponsive.

Need to check, is there any command in Iceberg, running that will re-establish tables identity and we can enable query again? Thanks

Fokko commented 2 months ago

@jackye1995 do you have a moment to shed some light on this?

sham-hq commented 2 months ago

Yes sure. We have created Iceberg type table in Glue Catalog. Data is residing at S3. Our jobs are writing data to this table at every one hour and we can also query the data.

Then at every 24 hour a sync job runs by Foundation team, which syncs tables from all AWS account to central Glue Catalog(I am following up with Foundation team as what they do in their job, so that I can look out for solution). After this job runs, our sql query to Iceberg table returns no data. Table exists but no data get returned. When we check S3 path of table, old and new all data is there.

My understanding is that job run by Foundation team, removes the identity of our Iceberg table. So, when we query the table, it is not able to understand the type of table, and because of same returning no data. And nothings changes in Glue Catalog for table metadata.

When again we run our job, data get created in table and we are able to query data until next time Foundation team's job runs. But the problem is, when we write data again, we can only query the newly written data not the older data. And I am looking out for a command which can help Iceberg table to understand its previous metadata and respond to the query with all data, till the time Foundation team does their fix.

Please let me know if you need some specific details.

ericlgoodman commented 2 months ago

My understanding is that job run by Foundation team, removes the identity of our Iceberg table.

And nothings changes in Glue Catalog for table metadata.

These two statements appear to be in conflict with each other. Athena determines whether to read your table through the Iceberg connector based on the presence of table_type = ICEBERG in your Glue table metadata.

Can you be more specific about what you mean when you say that the "identity" of the table is being removed? It would be even better to provide clear steps to reproduce the problem you're facing.

abhiips07 commented 1 month ago

Hi, Is this issue resolved? I m also using AWS Athena for creating the Iceberg Table. My table gets created but when i run Select on table I get no result.

CREATE TABLE elastic_db.configdata_iceberg_table ( adaptername string, auditid bigint ) LOCATION 's3://table-loc' TBLPROPERTIES ( 'table_type'='iceberg', 'format'='parquet', 'write_compression'='snappy' );

.

nastra commented 1 month ago

@jackye1995 could you take a look at this please?

yashgangrades commented 3 days ago

I am also facing the same issue. I am using Glue to create an Iceberg table from raw data in S3, but when I run a SELECT * query, it returns 0 results.

apache / iceberg