duckdb / duckdb_iceberg

MIT License
107 stars 18 forks source link

Table schema evolution support #33

Closed harel-e closed 3 months ago

harel-e commented 7 months ago

First of all, thank you for this useful extension. I can use the extension to read iceberg tables just fine. As soon as the schema changes, the extension throws an error

Use case:

Using Trino to manage Iceberg on AWS/Glue/S3, I issued the following:

create table test(a int); insert into test values(1); alter table test add column b int; insert into test values(2,5);

select * from test; a | b ---+------ 1 | NULL 2 | 5

Using the last metadata file, I issued the following in DuckDB (after loading the aws and iceberg extensions)

select * from ICEBERG_SCAN('s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/metadata/00005-5a37e1af-dbf3-48c7-b7c7-11309ecc6279.metadata.json');

Error: IO Error: Failed to read file "s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/data/20231204_085221_00020_8b7ws-30eb76d3-2f3e-4a01-a030-976f7640d26a.parquet": schema mismatch in glob: column "b" was read from the original file "s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/data/20231204_085314_00023_8b7ws-1e49b4ae-9d23-469b-8e00-6dd48da1d0a0.parquet", but could not be found in file "s3:///test-0fc6feb5c66b4915b39dd2d0511105b7/data/20231204_085221_00020_8b7ws-30eb76d3-2f3e-4a01-a030-976f7640d26a.parquet". Candidate names: a If you are trying to read files with different schemas, try setting union_by_name=True

As I mentioned above, reading the data before the schema change worked just fine.

Thank you, Harel

samansmink commented 7 months ago

Hey @harel-e this was actually just added in https://github.com/duckdb/duckdb_iceberg/pull/30 This will be available next release of duckdb which is scheduled for end of january