apache / iceberg

Apache Iceberg
https://iceberg.apache.org/
Apache License 2.0
6.37k stars 2.2k forks source link

SnapshotTableProcedure to migrate iceberg tables from one namespace to another #10262

Open Gowthami03B opened 5 months ago

Gowthami03B commented 5 months ago

Feature Request / Improvement

Hello

The current snapshot procedure (https://iceberg.apache.org/docs/nightly/spark-procedures/?h=spark_catalog#snapshot) seems to be helpful in only migrating from external Hive to iceberg tables.

But we have a unique use case where we want to migrate some of our tables from one namespace to another and later run 'alter schema operations' (which is metadata only) that would have worked perfectly for us with the "snapshot" procedure since it utilizes the underlying data files while having the new table's metadata in a new location. The rest of tables in the old namespace would have to be backfilled as we have major changes, but we would avoid a bunch of effort and storage space(talking TB's here) if we could use snapshot procedure.

spark_jdbc_config = {
    "spark.sql.extensions": "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions",
    "spark.sql.catalog.my_catalog": "org.apache.iceberg.spark.SparkCatalog",
    "spark.sql.catalog.my_catalog.catalog-impl": "org.apache.iceberg.jdbc.JdbcCatalog",
    "spark.sql.catalog.my_catalog.uri": "jdbc:comdb2://",
    "spark.sql.catalog.my_catalog.warehouse": "s3a://abc",   
 }
spark.sql(
            f"""
                CALL my_catalog.system.snapshot(
                    source_table => 'ns1.src_dataset' **# SparkConnectGrpcException: (org.apache.iceberg.exceptions.NoSuchTableException) Cannot not find source table 'datasets.equitynamr'**
                    table => 'ns2.src_dataset',
                    location => 's3a://abc'
                )
            """
        )

my_catalog here is the JDBC catalog that holds both the namespaces ns1, ns2 and all of our tables.

When I try to provide source_table as fully qualified name (my_catalog.ns1.src_dataset), I get this -IllegalArgumentException: Cannot snapshot a table that isn't in the session catalog (i.e. spark_catalog). Found source catalog: test.

I also tried explicitly creating a table with a catalog entry for 'spark_catalog', and that resulted in - IllegalArgumentException: Cannot use non-v1 table 'ns1.src_datasets' as a source

Is there any workaround to achieve my use case? Does this seem like a valid request that can be accommodated?

Also tried exploring the add_files procedure, but it does currently take only a prefix of the s3 path to the data files location of the source table and not a list of file paths from current snapshot's data files. It would rather be helpful to add only files that are part of the current snapshot.

Query engine

spark

gauthamnair commented 4 months ago

We are experiencing the same problem: https://apache-iceberg.slack.com/archives/C025PH0G1D4/p1718395195306979