Open BsoBird opened 9 months ago
@RussellSpitzer @nastra can you help me?
I also found another situation where I have a table that has never had a CALL command executed on it. All I did was to run a MERGE INTO once a day, but after the OOM(Write interrupt), the table did happend same thing. It's been so long that I didn't keep a log of that scenario.
@chennurchaitanya Please check if this question is similar to yours We can discuss this issue together here
When I see this sort of thing it's usually one of two issues
The user has accidentally run some command which deletes files without talking to iceberg, the snapshot refers to a file which no longer exists. This was done accidentally so the file must be removed via a metadata delete to get the table back to a healthy status.
Common reasons for this
The query is running a snapshot which has been expired but not yet removed from cache, or the query was started and then the snapshot was expired. In this case you just refresh the cache and the query will work
@RussellSpitzer
--User has mutliple tables homed in the same directory, Remove orphan files for one deletes files for the other. This is generally unrecoverable and you'll notice that many files do not exist that should according to iceberg metadata no,Our catalogues and tables are one-to-one.
--User error, manually deleting a file. no. No one but me can manipulate iceberg's data files. And I just run the CALL command every day.
--Third party ttl system no. we haven't.
--The query is running a snapshot which has been expired but not yet removed from cache, or the query was started and then the snapshot was expired. In this case you just refresh the cache and the query will work How do I flush the cache? I've restarted the spark job. I've restarted the spark job. If it's an in-memory cache, then it shouldn't have a problem. If it's cached in memory, then it shouldn't be a problem, and I do have a snapshot of the file.
I will submit a PR for fix.
Was the MERGE operation committed successfully? if it was successfully and committed files were removed by remove_orphan_files job, the latest snapshot will refer to lost files.
@Zhangg7723 Our data table does not have concurrent operations. I am very certain of this.
step 3:At the time of the OOM, the dwd.b_std_category table was executing this command.
Does that cause any problems?
Apache Iceberg version
1.4.2 (latest release)
Query engine
Spark
Please describe the bug 🐞
SPARK 3.4.1.
We found that, in some cases, ICEBERG table may be FileNotFoundException (hadoop catalog table). The thing is like this:
We will start to execute the MERGE INTO operation of 8 tables at 1:00 a.m. every day. After the operation is completed, the following three operations are performed.
Today, when executing the MERGE operation, an OOM occurred, causing the container to be killed.
At the time of the OOM, the dwd.b_std_category table was executing this command.
When we resumed the SPARK task, we found that the dwd.b_std_category table could not be read.