JanKaul / iceberg-rust

Rust implementation of Apache Iceberg with integration for Datafusion
Apache License 2.0
74 stars 11 forks source link

Question: Datafile writer #12

Closed ForeverAngry closed 1 month ago

ForeverAngry commented 5 months ago

It looks like this project does have the capability to insert and update records in an existing iceberg table. Am i correct this? Look forward to hearing!

JanKaul commented 5 months ago

Hey, the crate supports inserting but no updates because it can't handle deletes. The best way to insert is with datafusion.

Check out this test: https://github.com/JanKaul/iceberg-rust/blob/7b65b34504e710b62bd33d7d46c17be97929c08e/datafusion_iceberg/src/table.rs#L650

Do you want to use it with datafusion or as a rust library?

ForeverAngry commented 5 months ago

Hi! Well I'll have to do some reading on datafusion. I'm not familiar with it. But I was hoping to use it in rust, with a polars code base I have.

JanKaul commented 5 months ago

For now, you can have a look at the following method:

https://github.com/JanKaul/iceberg-rust/blob/7b65b34504e710b62bd33d7d46c17be97929c08e/datafusion_iceberg/src/table.rs#L521

I will try to simplify the writer design for non-datafusion use cases.

ForeverAngry commented 5 months ago

That would be awesome! Also I'd love to contribute - I've used the Java iceberg writer a bit, but if you had some diagram or pseudo code of the steps to complete a successful transaction (update, merge..erc) id be happy to help!

Also, do you have an example of how I could test the writer with aws glue?

JanKaul commented 5 months ago

I haven't implemented the aws glue catalog, so you might need to implement it yourself.

If you have the catalog writing looks something like this:

        use iceberg_rust::arrow::write_parquet_partitioned;

        // Get table from catalog
        let tabular = catalog.load_table(Identifier::parse("my_catalog.my_schema.my_table")?);

        // Make sure its a table and not a view
        let table = if let Tabular::Table(table) = &tabular {
            Ok(table)
        } else {
            Err(Error::InvalidFormat("database entity is not a table".to_string()))
        }?;

        // Write arrow batches to object_store
        let metadata_files =
            write_parquet_partitioned(table, arrow_batches, None)
                .await?;

        // Create table transaction
        table
            .new_transaction(None)
            .append(metadata_files)
            .commit()
            .await?;
JanKaul commented 5 months ago

If you want to read iceberg tables with polars, this crate is not the best option for you. It's not able to do partition pruning with polars and always has to do a full table scan. The apache repo is working on an expression system that will make this possible.

However, if you use this crate with datafusion it performs partition pruning.

ForeverAngry commented 5 months ago

I'm open to using datafusion to read the data, the real need I have is just to be able to write partitioned iceberg files using a glue catalog.

JanKaul commented 5 months ago

If I have time I'll look into the glue catalog. But it could be a while.

JanKaul commented 1 month ago

As the REST catalog is becoming the standard catalog implementation, I'm not planning to add HMS support.