duckdb / duckdb_iceberg

MIT License
107 stars 18 forks source link

Support to skip schema inference #45

Closed devendrasr closed 3 months ago

devendrasr commented 3 months ago

The current version does not support complex data type parsing while inferring the schema from within the snapshot. By the time support for complex data type comes, I am introducing a flag that can be used to skip this flow. This will offload schema parsing to the underlying parquet extension. Here is how you can do it -

scan data:

SELECT * FROM iceberg_scan("s3://my-bucket/icebergwh/someschema/t01", skip_schema_inference = true) limit 10;

scan metadata:

SELECT * FROM iceberg_metadata("s3://my-bucket/icebergwh/someschema/t01", skip_schema_inference = true) limit 10;

scan snapshots:

SELECT * FROM iceberg_snapshots("s3://my-bucket/icebergwh/someschema/t01", skip_schema_inference = true) limit 10;

Note - I am closing an earlier PR that was requesting these changes and was a bit complex to understand - https://github.com/duckdb/duckdb_iceberg/pull/43

samansmink commented 3 months ago

looks good, thanks!

harel-e commented 3 months ago

@samansmink - Hi, I downloaded DuckDB nightly and didn't find this feature (skip_schema_inference) Will it be part of the upcoming 0.10.2? Thanks

samansmink commented 3 months ago

@harel-e are you sure? for me it works:

force install iceberg from 'http://nightly-extensions.duckdb.org';
load iceberg;
FROM iceberg_metadata("my_iceberg_table", skip_schema_inference = true);
harel-e commented 3 months ago

@samansmink - I wasn't aware of force install, but it still failed.

Using the nightly build binary

./duckdb v0.10.2-dev265 2687e2d6d9 Enter ".help" for usage hints. Connected to a transient in-memory database. Use ".open FILENAME" to reopen on a persistent database.

D force install iceberg from 'http://nightly-extensions.duckdb.org'; HTTP Error: Failed to download extension "iceberg" at URL "http://nightly-extensions.duckdb.org/2687e2d6d9/osx_arm64/iceberg.duckdb_extension.gz" Extension "iceberg" is an existing extension.

Are you using a development build? In this case, extensions might not (yet) be uploaded.

samansmink commented 3 months ago

@harel-e yea we don't have good update semantics (yet) for extensions. Force installing will override your current installation with whatever you provide, otherwise DuckDB will not update thinking that iceberg is already installed.

Using the nightly build binary

That's a bit quirky atm: we distribute nightly binaries for extensions that target the latest stable release of duckdb, and we distribute nightly binaries of duckdb with stable versions of extensions. But we do not distribute nightly extensions for nightly binaries of duckdb automatically so these can be behind sometimes.

I will bump the iceberg extension in duckdb main which should resolve this

harel-e commented 2 months ago

@samansmink - Thank you for making this change available in the extensions. Will this PR be available in the upcoming 0.10.2 version as part of the stable extension version? (i.e. just using 'install iceberg') ?