Iceberg data file Not Found but have an entry in table.files catalog

apache / iceberg

Apache Iceberg

https://iceberg.apache.org/

Apache License 2.0

6.6k stars 2.27k forks source link

Iceberg data file Not Found but have an entry in table.files catalog #8338

Open chennurchaitanya opened 1 year ago

chennurchaitanya commented 1 year ago

Apache Iceberg version

1.1.0

Query engine

Spark

Please describe the bug 🐞

My Job was running fine for a long time and today we got this exception.

Getting exception "Caused by: java.io.FileNotFoundException: No such file or directory", while accessing data from iceberg table using below code snippet.

val df = spark.read.format("iceberg").option("start-snapshot-id", start_snapshot).option("end-snapshot-id", end_snapshot).load("mytablename")

We are using minio as our storage backend. We have file_path entry in mytable.files but physical file is not present. As per my understanding iceberg have strong write consistency , unless all files are written into storage backend, snapshot will not be created.

We tried to run orphan file removal, but doesn't help.

Can someone from iceberg SME's , let us know why this would happen and how to resolve ?

nastra commented 1 year ago

Could you please share the full stack trace on when the error happens and the action that you're running when it happens? Also an overview of the xyz.files would be helpful

nastra commented 1 year ago

Is there a particular reason that you're using Hadoop's S3AFileSystem? You could switch to using S3FileIO when using Minio. Is it possible that s3a://XXXXXXXXXX/data/event_ts_day=1970-01-01/00007-167930-e397f970-eb12-41ad-9e10-e855c8fd6e53-00001.parquet got deleted in the meantime by another job? Also can you share the output of xyz.files and your catalog configuration just in case?

nastra commented 1 year ago

Sorry it's a bit difficult to tell exactly what's going on without having full access to the logs and knowing what actions were happening at that time

github-actions[bot] commented 2 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

dorsegal commented 2 months ago

Is there a particular reason that you're using Hadoop's S3AFileSystem? You could switch to using S3FileIO when using Minio. Is it possible that s3a://XXXXXXXXXX/data/event_ts_day=1970-01-01/00007-167930-e397f970-eb12-41ad-9e10-e855c8fd6e53-00001.parquet got deleted in the meantime by another job? Also can you share the output of xyz.files and your catalog configuration just in case?

I am facing something similar. One of the files got deleted (still trying to figure out how) is there way to fix this? I tried rewriting manifest files and data files but nothing seems to work.