Open alamb opened 2 months ago
Regarding the point on storing data - i wonder if the CacheManager could be extended / used to serialize the files / file_stats caches.
Regarding the point on storing data - i wonder if the CacheManager could be extended / used to serialize the files / file_stats caches.
That is an excellent idea -- it would be really sweet to have an excuse to work on that API (I bet it is not used anywhere near as much as it could be)
I have subsequently learned about the dft
config file system that has certain overlap with what is described above
However, one difference is that the config file basically applies to all sessions, where this catalog would be very explicitly selected by a user when running
Thus it seem like the config file would be a great place to put credentials, for example, that you wanted to apply to all catalogs / sessions
There are two different setups that I think are relevant to this feature:
datafusionrc
file which is simply a SQL file at ~/.datafusion/.datafusionrc
that is meant to have DDL that users would want available at start up.For example, this is mine:
CREATE SCHEMA staging;
CREATE SCHEMA production;
CREATE EXTERNAL TABLE staging.min_aggs STORED AS PARQUET LOCATION 'ny2://sip/aggregates/minute_by_ticker_monthly_v2/year_month=2015-08/data.01.parquet';
CREATE EXTERNAL TABLE staging.trades_v2 STORED AS PARQUET LOCATION 'ny2://atlas/sip/trades_v2/date=2024-08-30/2024-08-30.01.parquet';
CREATE EXTERNAL TABLE "production".min_aggs STORED AS PARQUET LOCATION 'ny5://sip/aggregates/minute_by_ticker_monthly/year_month=2015-08/data.01.parquet';
CREATE TABLE wide AS VALUES (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20);
I think the setup for this could be improved (its still based on the old setup from a couple years ago). I think moving the file to ~/.config/dft
(so all config in same location) and renaming to something like ddl.sql
in that directory would be better.
One thing to note, I have been having trouble getting the custom catalog / schemas to work (created an issue for it. Maybe just a user error, but i havent had time to look too deeply into it.
I agree that there is an importatnt distinction between "configuration" (with e.g. credentials) that potentially apply to all sessions and the DDL/catalog setup part
I think some systems permit creating credentials that are stored as part of the catalog (CREATE CREDENTIALS ....
). This kind of makes sense if you think about the catalog / table definitions as application configuration. However this seems like a substantial security risk to me as well. 🤔
I'll have to mess with it to see what is happening
This is my own personal aspirations / goals for a "file based catalog"
Usecase
Usecase 1: pre-configured
EXTERNAL TABLES
I would like to be able to setup some table definitions in dft and then reuse them from session to session
For example
And then have this configuration available to any dft session
I believe this usecase is already partly handled by the
config
file feature. However, there are some other things I would like:Usecase 1: ephemeral data
Today when you run queries like this in
dft
If you start another session of
dft
foo
is gone:The issue is that the default catalog in datafusion is an ephemeral file based one so there is no place to store data such as shown above.
Desired Behavior
What I would like is for dft to operate similarly to sqlite or duckdb.
By default, an ephemeral in memory catalog is used and nothing is saved after the session quits
However a database file can be "opened" and if so then all changes made to the catalog are stored in that file. If the file is reopened on a subsequent invocation of the program all the DDL / catalog information is still present
Something like