apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.52k stars 2.25k forks source link

Inconsistent API for remove_orphan_files and DeleteOrphanFiles #7480

Closed sushi30 closed 11 months ago

sushi30 commented 1 year ago

Apache Iceberg version

1.2.1 (latest release)

Query engine

Spark

Please describe the bug 🐞

According to iceberg docs. There are two methods for running spark procedures:

  1. SQL
  2. Spark Actions

For remove_orphan_files it appears that the SQL procedure has the max_concurrent_deletes parameter which does not exist in the spark action DeleteOrphanFiles.

While it is theoretically possible to imitate the SQL behavior using executeDeleteWith, the method that is used under the hood is a private method removeService. Therefore this is not possible and it creates a discrepancy in the APIs.

dramaticlly commented 1 year ago

I dont think there's guarantee for keeping the API consistent between iceberg SparkAction and SparkProcedure. The Procedure can be exposed and used by client who's more familiar with SparkSQL interface while SparkAction provide more versatile capabilities to allow native integration in java or scala.

If you want to run multithreading delete in spark 3.1 actions, this is how it can be done below in scala/java

import org.apache.iceberg.Table
import org.apache.iceberg.actions.DeleteOrphanFiles
import org.apache.iceberg.spark.actions.SparkActions
import org.apache.spark.sql.SparkSession

import java.util.concurrent.{Executors, TimeUnit}

class RemoveOrphansAPI {

  def removeOrphansWithSparkAction(
      sparkSession: SparkSession,
      table: Table,
      threadsCount: Int,
      olderThanTS: Long
  ): DeleteOrphanFiles.Result = {

    val executor = Executors.newFixedThreadPool(threadsCount)
    val result: DeleteOrphanFiles.Result = SparkActions
      .get(sparkSession)
      .deleteOrphanFiles(table)
      .olderThan(olderThanTS)
      .executeDeleteWith(executor)
      .execute()

    executor.shutdown()
    result
  }
}
github-actions[bot] commented 11 months ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 11 months ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'