apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.15k stars 2.14k forks source link

flink restore failed with filenotfound #6066

Open chenwyi2 opened 1 year ago

chenwyi2 commented 1 year ago

Apache Iceberg version

0.14.1 (latest release)

Query engine

Flink

Please describe the bug 🐞

We have a flink job that write upsert stream into a partitioned icebergV2 table . When that job get failed, we restart it from the latest checkPoint. But we got that exception: Files does not exists.FileNotFoundException: File does not exist: /rbf/warehouse/cupid_bi.db/ads_qixiao_olap_1min/metadata/c918a379b3cc15d7a8193cf27eb8b473-00000-1-38851-10287.avro and i saw task.log 2022-10-21 16:57:19,188 INFO org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - Committing append with 2 data files and 0 delete files to table icebergCatalog.xxx 2022-10-21 16:57:19,573 INFO org.apache.iceberg.BaseMetastoreTableOperations [] - Successfully committed to table icebergCatalog.xx 2022-10-21 16:57:19,573 INFO org.apache.iceberg.SnapshotProducer [] - Committed snapshot 7536746147835307981 (MergeAppend) 2022-10-21 16:57:19,594 INFO org.apache.iceberg.BaseMetastoreTableOperations [] - Refreshing table metadata from new version: qbfs://online01/warehouse/xx/metadata/99171-45858db6-5917-4397-afcc-76c10ea80305.metadata.json 2022-10-21 16:57:19,750 INFO org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - Committed in 562 ms without flushing snapshot state to state backend the reason is metadata checkpoint information is diffrent from snapshotstate? since the jobfailed without flush snapshot state to state backend

stevenzwu commented 1 year ago

.avro might be a manifest file. do you have the complete stack trace? Which Flink version?

I couldn't find this log line in 1.13 (or 1.14 and 1.15).

2022-10-21 16:57:19,750 INFO org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - Committed in 562 ms
without flushing snapshot state to state backend

1.13 has log line without the part after Committed in 562 ms. https://github.com/apache/iceberg/blob/master/flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java

chenwyi2 commented 1 year ago

.avro might be a manifest file. do you have the complete stack trace? Which Flink version?

I couldn't find this log line in 1.13 (or 1.14 and 1.15).

2022-10-21 16:57:19,750 INFO org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - Committed in 562 ms
without flushing snapshot state to state backend

1.13 has log line without the part after Committed in 562 ms. https://github.com/apache/iceberg/blob/master/flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/sink/IcebergFilesCommitter.java

it is my mistake, the right log should be "2022-10-21 16:57:19,188 INFO org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - Committing append with 2 data files and 0 delete files to table icebergCatalog.xxx 2022-10-21 16:57:19,573 INFO org.apache.iceberg.BaseMetastoreTableOperations [] - Successfully committed to table icebergCatalog.xx 2022-10-21 16:57:19,573 INFO org.apache.iceberg.SnapshotProducer [] - Committed snapshot 7536746147835307981 (MergeAppend) 2022-10-21 16:57:19,594 INFO org.apache.iceberg.BaseMetastoreTableOperations [] - Refreshing table metadata from new version: qbfs://online01/warehouse/xx/metadata/99171-45858db6-5917-4397-afcc-76c10ea80305.metadata.json 2022-10-21 16:57:19,750 INFO org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - Committed in 562 ms 2022-10-21 16:57:20,090 INFO org.apache.iceberg.flink.sink.IcebergFilesCommitter [] - Start to flush snapshot state to state backend, table: icebergCatalog.cupid_bi.ads_qixiao_tracking_hawkeye_no_filter_1min, checkpointId: 8227" but in my situation, the task existed without showing flushing snapshot state to state backend, because of nodemanager restart, then i restart fllink job, the job failed with FileNotFoundException in the hdfs audit log, i never saw that avro file has been created.

stevenzwu commented 1 year ago

Start to flush snapshot state to state backend

This happens in IcebergFilesCommitter#snapshotState. if checkpoint N didn't complete successfully, the written manifest file for the incomplete checkpoint won't be used because last completed checkpoint is N-1.

congd123 commented 1 year ago

@stevenzwu if this scenario happens

if checkpoint N didn't complete successfully, the written manifest file for the incomplete checkpoint won't be used because last completed checkpoint is N-1.

What is the best approach to recover the job?

I have a similar behavior using Flink 1.14.1 and Iceberg 1.0.0 (V2)

jad-grepr commented 1 month ago

Hi! We're running into this issue with Iceberg 1.5.2 and Flink 1.18.1. Seems like it's not yet fixed. We're happy to dedicate resources to fix it if we could get a pointer on resolving it. @stevenzwu would you please point us in the right direction? Thanks!