spark-procedures migrating tables can pose fatal problems

vinnielhj commented 1 year ago

Apache Iceberg version

1.3.1 (latest release)

Query engine

Spark

Please describe the bug 🐞

Environment: spark3.2.1,iceberg1.3.1,org.apache.iceberg.spark.SparkSessionCatalog

desc: I have a Hive table test.sample, I execute now CALL spark_catalog.system.migrate(table => 'test.sample') Migrate it as an Iceberg table. At this time, there are two directories test.db/sample/ and test.db/samplebackup on the file system, and show tables can also see that there are two tables test.sample and test.samplebackup. When I migrate and check the data is correct, I may need to delete the backup table test.samplebackup. When I execute DROP TABLE test.samplebackup, the table is deleted after the execution is completed, and at the same time the file system test.db/samplebackup directory will be deleted. When I query the iceberg table again, test.sample will throw an error that the file cannot be found, because the file cannot be indexed. I don't think this is reasonable, just removing the meta information and not the data files is a good way of doing things in my opinion.

Steps:

CREATE TABLE test.sample (id bigint, data string) stored as parquet;
insert into test.sample values (1,'22');
CALL spark_catalog.system.migrate(table => 'test.sample');
drop table test.samplebackup;
select * from test.sample;

manuzhang commented 1 year ago

I see the same issue with CALL catalog_name.system.migrate(table => 'db.sample', drop_backup => true) which is a shortcut of the above steps.

@aokolnychyi @RussellSpitzer Is this by design for this action?

RussellSpitzer commented 1 year ago

If I remember correctly the issue here is we make a backup of the catalog entry, for managed tables this also copies files but for external tables it does not. Since it is not a backup of the table and is actually just a backup of the reference to the table dropping it can have deleterious effects in some cases if you drop with purge I'm guessing

But I am a bit confused which files are being dropped that are required? Shouldn't all the iceberg references be to the original location?

manuzhang commented 1 year ago

I'm testing with a managed table. The directories after migration are as follows

/user/hive/warehouse/db/sample/metadata
/user/hive/warehouse/db/sample_backup_

There's no copy of original data and if sample_backup_ is dropped, db.sample can't be queried anymore.

vinnielhj commented 10 months ago

Sorry, I haven't followed up on this question for a long time due to personal reasons.

I think the backup table can be designed as an external table, so that after the backup table is deleted, only the metadata will be deleted without deleting the data files, so that iceberg will not be affected.

In certain circumstances, it is necessary to delete the backup table. After the deletion is performed, the data file is also deleted, which makes the iceberg table unavailable.

Looking forward to your reply @aokolnychyi @RussellSpitzer @manuzhang

manuzhang commented 10 months ago

Our users would like to have Iceberg's table data separated from the backup data after migration. Hence, I modified the migrate procedure such that when a location for the Iceberg table is provided, data will be copied over.

github-actions[bot] commented 3 weeks ago

This issue has been automatically marked as stale because it has been open for 180 days with no activity. It will be closed in next 14 days if no further activity occurs. To permanently prevent this issue from being considered stale, add the label 'not-stale', but commenting on the issue is preferred when possible.

github-actions[bot] commented 2 days ago

This issue has been closed because it has not received any activity in the last 14 days since being marked as 'stale'

apache / iceberg