apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
5.85k stars 2.06k forks source link

RemoveOrphanFiles - Does not work / Error when try to remove orphan files from s3 with glue catalog run from EMR #3054

Open raghavendraD opened 2 years ago

raghavendraD commented 2 years ago

Hi,

RemoveOrphanFiles is working with only hadoop FS/IO and when run from local with hadoop catalog. when i try to run it for S3 files using glue catalog and from EMR. It throws the below error. i have tried with both iceberg 11,12 and also spark 3.0.1, spark 3.1.1 (all combinations) and also tried both the commands from Actions API and also from Spark Actions API. the result does not change.

Actions.forTable(table).removeOrphanFiles().olderThan(removeOrphanFilesOlderThan).execute(); or SparkActions.get().deleteOrphanFiles(table).olderThan(removeOrphanFilesOlderThan).execute();

and the error is

21/08/31 05:40:36 ERROR RemoveOrphanFilesMaintenanceJob: Error in RemoveOrphanFilesMaintenanceJob - removeOrphanFilesOlderThanTimestamp, Illegal Arguments in table properties - Can't parse null value from table properties, tenant: tenantId1, table: lakehouse_database.mobiletest1, removeOrphanFilesOlderThan: 1630388136606, Status: Failed, Reason: {}. java.lang.IllegalArgumentException: Cannot find the metadata table for glue_catalog.lakehouse_database.mobiletest1 of type ALL_MANIFESTS at org.apache.iceberg.spark.SparkTableUtil.loadMetadataTable(SparkTableUtil.java:634) at org.apache.iceberg.spark.actions.BaseSparkAction.loadMetadataTable(BaseSparkAction.java:153) at org.apache.iceberg.spark.actions.BaseSparkAction.buildValidDataFileDF(BaseSparkAction.java:119) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.doExecute(BaseDeleteOrphanFilesSparkAction.java:154) at org.apache.iceberg.spark.actions.BaseSparkAction.withJobGroupInfo(BaseSparkAction.java:99) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:141) at org.apache.iceberg.spark.actions.BaseDeleteOrphanFilesSparkAction.execute(BaseDeleteOrphanFilesSparkAction.java:76) at org.apache.iceberg.actions.RemoveOrphanFilesAction.execute(RemoveOrphanFilesAction.java:87) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFilesOlderThanTimestamp(RemoveOrphanFilesMaintenanceJob.java:273) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.removeOrphanFiles(RemoveOrphanFilesMaintenanceJob.java:133) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.job.RemoveOrphanFilesMaintenanceJob.maintain(RemoveOrphanFilesMaintenanceJob.java:58) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.run(LakeHouseTableMaintenanceJob.java:136) at com.salesforce.cdp.spark.core.job.SparkJob.submitAndRun(SparkJob.java:76) at com.salesforce.cdp.lakehouse.spark.tablemaintenance.LakeHouseTableMaintenanceJob.main(LakeHouseTableMaintenanceJob.java:236) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.base/java.lang.reflect.Method.invoke(Method.java:566) at org.apache.spark.deploy.yarn.ApplicationMaster$$anon$2.run(ApplicationMaster.scala:735)

Is it something to do with my implementation or is it a bug with an iceberg? or am i missing something her? please help !

Thanks, Raghu

nastra commented 10 months ago

This might be fixed by https://github.com/apache/iceberg/pull/7914