Integrate with datafusion

ZENOTME commented 3 months ago

After support basic scan and catalog, we can consider to integrate with datafusion to speed up data driven tests.

ZENOTME commented 3 months ago

I'm working the draft so that we can have more clear discussion above it. Blocked by #277 now.

marvinlanhenke commented 3 months ago

@ZENOTME I'm interested in your approach, perhaps you can outline what you are going to do (high-level). I'm just curious and want to understand / research where those integrations points / interfaces might be? Thanks in advance and best regards.

ZENOTME commented 3 months ago

Thanks for raising this discussion @marvinlanhenke! The basic idea for the integration is to provide the wrap struct using type in iceberg-rust so that users can use them to connect with datafusion directly.

Implementation outline

1. Implement trait for managing the iceberg table.

The datafusion provides the following trait to manage the table:

CatalogProviderList
CatalogProvider
SchemaProvider
TableProvider

We can map them into the type in iceberg-rs

CatalogProviderList: Maybe we don't need to implement this
CatalogProvider: Catalog in iceberg-rust
SchemaProvider: Namespace in iceberg-rust
TableProvider: Table in iceberg-rust

We can implement them by wrapping using type in iceberg-rs internally.

Like

struct IcebergCatalogProvider {
  inner: iceberg_rs::Catalog
}

impl CatalogProvider for IcebergCatalogProvider {
  ...
}

2. Implement the trait for scanning the table.

And we also need to implement an ExecutionPlan for scan in TableProvider. This part we can rely on TableScan in iceberg-rs

Feel free to any suggestions and if something can be improved. Please let me know if there is something confusing.

marvinlanhenke commented 3 months ago

The datafusion provides the following trait to manage the table:

CatalogProviderList

CatalogProvider

SchemaProvider

TableProvider

Thank you so much for taking the time and making the effort to outline the approach.

I just was looking for those traits you mentioned. The rest is basically (over-simplified) just providing an Adapter, which is reasonable and easy to understand.

Perhaps, one more question though to clarify or to solidify my understanding...

...we would have to add datafusion as a dependency and implement those traits on our side, in order to provide the specific implementation of a CatalogProvider e.g. for the HiveMetastore. Then, a user can add our 'catalog provider' crate to their project alongside datafusion and use our provider. Is that correct?

Thanks again for explaining the approach.

apache / iceberg-rust