Open kevinzwang opened 3 months ago
for SQL, i think it'd be really nice to support the ATTACH/DETACH
semantics
such as
/path/to/my_iceberg/tbl_1
/path/to/my_lancedb/...
> ATTACH `/path/to/my_iceberg` FORMAT iceberg as iceberg; # potentially infer this
> select * from iceberg.tbl_1;
> ATTACH '/path/to/my_lancedb' FORMAT lance as lancedb;
> select * from lancedb.my_lance_table
> DETACH lancedb;
> select * from lancedb.my_lance_table; # errors
@universalmind303 do you think we should extend those semantics to work in Python as well?
I think one problem I here is that some formats (e.g. Iceberg) aren't completely self-describing from a URL, and may require configurations such as catalog URL, catalog type etc. This is easier to deal with in Python (we can use the Python SDK for Iceberg), but for SQL might be tricky.
daft.attach_tables(PyIcebergCatalog, format="iceberg")
daft.attach_tables("s3://...", format="lance")
@universalmind303 do you think we should extend those semantics to work in Python as well?
I think one problem I here is that some formats (e.g. Iceberg) aren't completely self-describing from a URL, and may require configurations such as catalog URL, catalog type etc. This is easier to deal with in Python (we can use the Python SDK for Iceberg), but for SQL might be tricky.
daft.attach_tables(PyIcebergCatalog, format="iceberg") daft.attach_tables("s3://...", format="lance")
yeah I definitely don't think we should have any sql exclusive features, so I think being able to perform the same operations in python would be a necessity
For things which are not self-describing with a URL/URI, it may be worth looking at the Hive CREATE EXTERNAL TABLE
syntax. There are several additional properties you may give. The STORED BY and TBLPROPERTIES allow for additional and arbitrary metadata.
For example,
CREATE EXTERNAL TABLE ice_fm_hive (i int) STORED BY ICEBERG TBLPROPERTIES ('metadata_location'='<object store or file system path>')
You could make this python with something like,
# definition
def create_table(table_name, schema, **kwargs): ...
# example
daft.create_table("ice_fm_hive", <schema>, stored_by="iceberg", metadata_location="..." })
This is a pretty literal translation, but I believe the image is clear. Now, I am not familiar with ATTACH/DETACH semantics – which might be extended with additional properties (like Hive did so with CREATE TABLE), but there is a nice consistency in extending existing SQL standard DDL commands.
It looks like there's existing support for CREATE EXTERNAL TABLE in sqlparser.
Currently, to work with catalog tables in Daft, users must connect to the catalog and fetch the catalog table information themselves before calling
daft.read_iceberg
,daft.read_deltalake
, or other table I/O functions. We could simplify this interface by allowing users to connect a catalog to Daft and have all the catalog's tables be registered and useable with the planneddaft.read_table
, in SQL, etc.Catalogs we would like to support
Ways to connect to a catalog
daft.register_unity_tables(..)
and similar