apache / iceberg-rust

Apache Iceberg
https://rust.iceberg.apache.org/
Apache License 2.0
469 stars 95 forks source link

idea: Introduce `memory` catalog #412

Open Xuanwo opened 1 week ago

Xuanwo commented 1 week ago

Hi, I came up with this idea while trying to create quick demos showcasing the capabilities and cool features of iceberg-rust. However, I found that setting up the catalog initially consumes most of the time. This isn't ideal for attracting new users or contributors.

I propose introducing a short-lived, in-memory catalog as an ideal starting point for either testing iceberg-rust or using it statelessly.

The design details are currently unclear, and I would like to seek comments and feedback on this idea. What do you think? Do you find the catalog useful?

Xuanwo commented 1 week ago

User Story A:

I'm a user of Iceberg downstream. I'm attempting to integrate Iceberg into my project and need to conduct unit tests to ensure the accuracy of my Iceberg-related code. However, I've discovered that I must first connect to a catalog. Although setting up a REST catalog is quick, it doesn't suit my needs well.

Xuanwo commented 1 week ago

User Story B:

I'm an external consumer of iceberg tables. My clients will store TiB of Iceberg data in S3 using their own catalogs. Please note, I don't have access to their catalog systems. The only thing available to me is paths to different tables. Which catalog should I set up to read/fetch data from these iceberg tables?


I know we have StaticTable, but it will need:

use iceberg::io::FileIO;
use iceberg::table::StaticTable;
use iceberg::TableIdent;

async fn example() {
    let metadata_file_location = "s3://bucket_name/path/to/metadata.json";
    let file_io = FileIO::from_path(&metadata_file_location).unwrap().build().unwrap();
    let static_identifier = TableIdent::from_strs(["static_ns", "static_table"]).unwrap();
    let static_table = StaticTable::from_metadata_file(&metadata_file_location, static_identifier, file_io).await.unwrap();
    println!("{:?}", static_table.metadata());
}

I want:

let table2 = catalog
    .load_table(&TableIdent::from_strs(["default", "t2"]).unwrap())
    .await
    .unwrap();
println!("{:?}", table2.metadata());
JanKaul commented 1 week ago

Great Idea, I think this could be really useful. We should be able to have this kind of behavior with the SQL catalog and an in-memory sqlite database.

liurenjie1024 commented 1 week ago

+1 for this idea.

liurenjie1024 commented 1 week ago

Not only in ut, but also useful in our example for demonstration.

Fokko commented 1 week ago

Great idea @liurenjie1024 I'm all for it!

I'm an external consumer of iceberg tables. My clients will store TiB of Iceberg data in S3 using their own catalogs. Please note, I don't have access to their catalog systems. The only thing available to me is paths to different tables. Which catalog should I set up to read/fetch data from these iceberg tables?

I'm not sure if this is the best example. Ideally when you have a fully functioning catalog, you should be able to expose the catalog with the right privileges (can be behind VPNs etc). It is a bad practice to register a table in multiple catalogs, since it won't track when a table is being updated across the catalogs.

I know we have StaticTable

StaticTable serves a different purpose, and is just ment to access read only tables.

In PyIceberg we had a MemoryCatalog in tests for a long while, and at some point there was a discussion to move this outside of the test directory. In the end we did not do this, and we used the SQLCatalog with a SQLite backend. This can work both fully in-memory, and also persisted locally (for example in /tmp/). I think having the ability to have some persistance will both benefit testing and demonstration since not all data will be gone after the process exits. Also, when we implement writing, we can leverage the locking mechanism from the DBMS.

Xuanwo commented 1 week ago

This can work both fully in-memory, and also persisted locally (for example in /tmp/).

Seems a great idea!

The situation differs slightly from the Rust side as we might not want to depend on sqlite, which significantly increases our build time. Perhaps we could incorporate both: memory and sqlite.

Fokko commented 6 days ago

As long as both of them are getting maintained :)