Open innocent123 opened 11 months ago
@innocent123: I do not really understand your question, but I think your problem might be similar to #5846.
@innocent123: I do not really understand your question, but I think your problem might be similar to #5846.
when i use spark api rewriteDataFiles is reported "org.apache.iceberg.exceptions.RuntimeIOException: Failed to get block locations for path: hdfs://xxxx/iceberg-hive/hadoop_prod/iceberg_x5l_pre/xxx/data/sample_date=20231107/00009-0-54295f94-ef26-44be-b6b8-ca3e472d9482-00010.parquet"
How do I fix this error?
@innocent123: I do not really understand your question, but I think your problem might be similar to #5846.
when i use spark api rewriteDataFiles is reported "org.apache.iceberg.exceptions.RuntimeIOException: Failed to get block locations for path: hdfs://xxxx/iceberg-hive/hadoop_prod/iceberg_x5l_pre/xxx/data/sample_date=20231107/00009-0-54295f94-ef26-44be-b6b8-ca3e472d9482-00010.parquet"
How do I fix this error?
this data file is lose,and manifest_path still exist
spark version 3.0.2
iceberg version1.0.0
I guess the missing data file was caused by deleteOrphanFiles set olderThan(System.currentTimeMillis())
, which mistakenly treated the uncommitted data files as orphan files and deleted them from the file system.
You can try doing it like deleteOrphanFiles set olderThan(System.currentTimeMillis() - 12Hours)
.
I guess the missing data file was caused by
deleteOrphanFiles set olderThan(System.currentTimeMillis())
, which mistakenly treated the uncommitted data files as orphan files and deleted them from the file system.You can try doing it like
deleteOrphanFiles set olderThan(System.currentTimeMillis() - 12Hours)
.
i execute deleteOrphanFiles set olderThan(System.currentTimeMillis() - 300000)
and then execute rewriteDataFiles.My table still report "org.apache.iceberg.exceptions.RuntimeIOException: Failed to get block locations for path" ,How can I manipulate the table to avoid reporting this error?How do I repair this table?
The problem is that the path in the manifest file does not exist in the hdfs. However, I cannot modify the manifest file directly. Is there an api that can modify the manifest file to delete a certain line of records?
The problem is that the path in the manifest file does not exist in the hdfs. However, I cannot modify the manifest file directly. Is there an api that can modify the manifest file to delete a certain line of records?
If you want to repair the table by ignoring the deleted files, org.apache.iceberg.Table#newDelete
can remove files from iceberg manifests. However, this approach can lead to data loss.
Another approach is to rollback to a historical snapshot, but it may not be feasible if you have already performed snapshot expiration.
The problem is that the path in the manifest file does not exist in the hdfs. However, I cannot modify the manifest file directly. Is there an api that can modify the manifest file to delete a certain line of records?
If you want to repair the table by ignoring the deleted files,
org.apache.iceberg.Table#newDelete
can remove files from iceberg manifests. However, this approach can lead to data loss.Another approach is to rollback to a historical snapshot, but it may not be feasible if you have already performed snapshot expiration.
Thank you very much! my historical snapshot is lost ,You can only recover by deleting files.One more question.Is there a way to quickly know that the manifest file no longer exists in the table?Or do I have to write the code to pull the manifest file and compare it with the hdfs directory?
Thank you very much! my historical snapshot is lost ,You can only recover by deleting files.One more question.Is there a way to quickly know that the manifest file no longer exists in the table?Or do I have to write the code to pull the manifest file and compare it with the hdfs directory?
I'm not sure if by "no longer exists" you mean the file doesn't exist in the file system or in the Iceberg metadata. However, these APIs can provide you with some information.
https://iceberg.apache.org/docs/latest/spark-queries/#manifests https://iceberg.apache.org/docs/latest/spark-queries/#all-manifests
Thank you very much! my historical snapshot is lost ,You can only recover by deleting files.One more question.Is there a way to quickly know that the manifest file no longer exists in the table?Or do I have to write the code to pull the manifest file and compare it with the hdfs directory?
I'm not sure if by "no longer exists" you mean the file doesn't exist in the file system or in the Iceberg metadata. However, these APIs can provide you with some information.
https://iceberg.apache.org/docs/latest/spark-queries/#manifests https://iceberg.apache.org/docs/latest/spark-queries/#all-manifests
the file doesn't exist in the file system.and Iceberg metadata exist
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
Query engine
When I write tables with flink, I periodically manipulate metadata with spark, rewriteDataFiles, rewriteManifests, expireSnapshots, deleteOrphanFiles are created in sequence, and expireOlderThan(System.curre) is set in expireSnapshots ntTimeMillis()) and deleteOrphanFiles set olderThan(System.currentTimeMillis()). If spark succeeds in the metadata operation for the first time, the metadata operation for the second time is displayed as "File does not exist". The data file associated with manifest_file is missing.
My question is:
2, Now my table query and merge operation both error, how to restore?
Question
When I write tables with flink, I periodically manipulate metadata with spark, rewriteDataFiles, rewriteManifests, expireSnapshots, deleteOrphanFiles are created in sequence, and expireOlderThan(System.curre) is set in expireSnapshots ntTimeMillis()) and deleteOrphanFiles set olderThan(System.currentTimeMillis()). If spark succeeds in the metadata operation for the first time, the metadata operation for the second time is displayed as "File does not exist". The data file associated with manifest_file is missing.
My question is:
2, Now my table query and merge operation both error, how to restore?