Add a data catalog API - Githubissues

kevinzwang commented 3 months ago

Currently, to work with catalog tables in Daft, users must connect to the catalog and fetch the catalog table information themselves before calling daft.read_iceberg, daft.read_deltalake, or other table I/O functions. We could simplify this interface by allowing users to connect a catalog to Daft and have all the catalog's tables be registered and useable with the planned daft.read_table, in SQL, etc.

Catalogs we would like to support

[ ] Iceberg (Tabular, Polaris, etc)
[ ] AWS Glue
[ ] Unity Catalog
[ ] Local catalog (manually specified tables)

Ways to connect to a catalog

[ ] daft.register_unity_tables(..) and similar
[ ] Environment variables/config file

universalmind303 commented 3 months ago

for SQL, i think it'd be really nice to support the ATTACH/DETACH semantics

such as

/path/to/my_iceberg/tbl_1
/path/to/my_lancedb/...

> ATTACH `/path/to/my_iceberg` FORMAT iceberg as iceberg; # potentially infer this
> select * from iceberg.tbl_1;

> ATTACH '/path/to/my_lancedb' FORMAT lance as lancedb;
> select * from lancedb.my_lance_table

> DETACH lancedb;
> select * from lancedb.my_lance_table; # errors

jaychia commented 3 months ago

@universalmind303 do you think we should extend those semantics to work in Python as well?

I think one problem I here is that some formats (e.g. Iceberg) aren't completely self-describing from a URL, and may require configurations such as catalog URL, catalog type etc. This is easier to deal with in Python (we can use the Python SDK for Iceberg), but for SQL might be tricky.

daft.attach_tables(PyIcebergCatalog, format="iceberg")
daft.attach_tables("s3://...", format="lance")

universalmind303 commented 3 months ago

@universalmind303 do you think we should extend those semantics to work in Python as well?

I think one problem I here is that some formats (e.g. Iceberg) aren't completely self-describing from a URL, and may require configurations such as catalog URL, catalog type etc. This is easier to deal with in Python (we can use the Python SDK for Iceberg), but for SQL might be tricky.
daft.attach_tables(PyIcebergCatalog, format="iceberg")
daft.attach_tables("s3://...", format="lance")

yeah I definitely don't think we should have any sql exclusive features, so I think being able to perform the same operations in python would be a necessity

RCHowell commented 3 months ago

For things which are not self-describing with a URL/URI, it may be worth looking at the Hive CREATE EXTERNAL TABLE syntax. There are several additional properties you may give. The STORED BY and TBLPROPERTIES allow for additional and arbitrary metadata.

For example,

CREATE EXTERNAL TABLE ice_fm_hive (i int) STORED BY ICEBERG TBLPROPERTIES ('metadata_location'='<object store or file system path>')

You could make this python with something like,

# definition
def create_table(table_name, schema, **kwargs): ...

# example
daft.create_table("ice_fm_hive", <schema>, stored_by="iceberg", metadata_location="..." })

This is a pretty literal translation, but I believe the image is clear. Now, I am not familiar with ATTACH/DETACH semantics – which might be extended with additional properties (like Hive did so with CREATE TABLE), but there is a nice consistency in extending existing SQL standard DDL commands.

It looks like there's existing support for CREATE EXTERNAL TABLE in sqlparser.

Eventual-Inc / Daft

Add a data catalog API #2602

Catalogs we would like to support

Ways to connect to a catalog