Closed sushi30 closed 11 months ago
I dont think there's guarantee for keeping the API consistent between iceberg SparkAction and SparkProcedure. The Procedure can be exposed and used by client who's more familiar with SparkSQL interface while SparkAction provide more versatile capabilities to allow native integration in java or scala.
If you want to run multithreading delete in spark 3.1 actions, this is how it can be done below in scala/java
import org.apache.iceberg.Table
import org.apache.iceberg.actions.DeleteOrphanFiles
import org.apache.iceberg.spark.actions.SparkActions
import org.apache.spark.sql.SparkSession
import java.util.concurrent.{Executors, TimeUnit}
class RemoveOrphansAPI {
def removeOrphansWithSparkAction(
sparkSession: SparkSession,
table: Table,
threadsCount: Int,
olderThanTS: Long
): DeleteOrphanFiles.Result = {
val executor = Executors.newFixedThreadPool(threadsCount)
val result: DeleteOrphanFiles.Result = SparkActions
.get(sparkSession)
.deleteOrphanFiles(table)
.olderThan(olderThanTS)
.executeDeleteWith(executor)
.execute()
executor.shutdown()
result
}
}
This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.
This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'
Apache Iceberg version
1.2.1 (latest release)
Query engine
Spark
Please describe the bug 🐞
According to iceberg docs. There are two methods for running spark procedures:
For remove_orphan_files it appears that the SQL procedure has the
max_concurrent_deletes
parameter which does not exist in the spark action DeleteOrphanFiles.While it is theoretically possible to imitate the SQL behavior using
executeDeleteWith
, the method that is used under the hood is a private methodremoveService
. Therefore this is not possible and it creates a discrepancy in the APIs.