datafusion-contrib / datafusion-dft

An opinionated and batteries included DataFusion implementation.
Apache License 2.0
115 stars 8 forks source link

Persistent file based catalog #122

Open alamb opened 2 months ago

alamb commented 2 months ago

This is my own personal aspirations / goals for a "file based catalog"

Usecase

Usecase 1: pre-configured EXTERNAL TABLES

I would like to be able to setup some table definitions in dft and then reuse them from session to session

For example

CREATE EXTERNAL TABLE ... STORED AS DELTA TABLE WITH CREDENTIALS ....

And then have this configuration available to any dft session

I believe this usecase is already partly handled by the config file feature. However, there are some other things I would like:

  1. Issue SQL commands to create these tables in a session and then have those tables available in the session without having to copy/paste the DDL into a new file
  2. Store data that is not DDL (such as parquet metadata fetched from a remote object store to save having to access it again)

Usecase 1: ephemeral data

Today when you run queries like this in dft

CREATE TABLE foo(x int);
INSERT INTO foo VALUES (1);
SELECT * FROM foo;

If you start another session of dft foo is gone:

│select * from foo;   
│Error during planning: table 'datafusion.public.foo' not found

The issue is that the default catalog in datafusion is an ephemeral file based one so there is no place to store data such as shown above.

Desired Behavior

What I would like is for dft to operate similarly to sqlite or duckdb.

By default, an ephemeral in memory catalog is used and nothing is saved after the session quits

However a database file can be "opened" and if so then all changes made to the catalog are stored in that file. If the file is reopened on a subsequent invocation of the program all the DDL / catalog information is still present

Something like

.open mycatalog.dat
-- foo is now saves into mycatalog.dat
CREATE TABLE foo(x int)
matthewmturner commented 2 months ago

Regarding the point on storing data - i wonder if the CacheManager could be extended / used to serialize the files / file_stats caches.

alamb commented 2 months ago

Regarding the point on storing data - i wonder if the CacheManager could be extended / used to serialize the files / file_stats caches.

That is an excellent idea -- it would be really sweet to have an excuse to work on that API (I bet it is not used anywhere near as much as it could be)

alamb commented 2 months ago

I have subsequently learned about the dft config file system that has certain overlap with what is described above

However, one difference is that the config file basically applies to all sessions, where this catalog would be very explicitly selected by a user when running

Thus it seem like the config file would be a great place to put credentials, for example, that you wanted to apply to all catalogs / sessions

matthewmturner commented 2 months ago

There are two different setups that I think are relevant to this feature:

For example, this is mine:

CREATE SCHEMA staging;
CREATE SCHEMA production;

CREATE EXTERNAL TABLE staging.min_aggs STORED AS PARQUET LOCATION 'ny2://sip/aggregates/minute_by_ticker_monthly_v2/year_month=2015-08/data.01.parquet';
CREATE EXTERNAL TABLE staging.trades_v2 STORED AS PARQUET LOCATION 'ny2://atlas/sip/trades_v2/date=2024-08-30/2024-08-30.01.parquet';

CREATE EXTERNAL TABLE "production".min_aggs STORED AS PARQUET LOCATION 'ny5://sip/aggregates/minute_by_ticker_monthly/year_month=2015-08/data.01.parquet';

CREATE TABLE wide AS VALUES (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);

I think the setup for this could be improved (its still based on the old setup from a couple years ago). I think moving the file to ~/.config/dft (so all config in same location) and renaming to something like ddl.sql in that directory would be better.

One thing to note, I have been having trouble getting the custom catalog / schemas to work (created an issue for it. Maybe just a user error, but i havent had time to look too deeply into it.

alamb commented 2 months ago

I agree that there is an importatnt distinction between "configuration" (with e.g. credentials) that potentially apply to all sessions and the DDL/catalog setup part

I think some systems permit creating credentials that are stored as part of the catalog (CREATE CREDENTIALS ....). This kind of makes sense if you think about the catalog / table definitions as application configuration. However this seems like a substantial security risk to me as well. 🤔

I'll have to mess with it to see what is happening