Closed alamb closed 3 weeks ago
I had a related problems a couple of weeks ago, I wanted to distribute a CLI that had some additional udfs bundled with it.
I wonder if we need to set up a plugin system for the CLI. I am happy to make a design proposal about it @alamb
I wonder if we need to set up a plugin system for the CLI. I am happy to make a design proposal about it @alamb
That would be awesome
Another thing that comes to mind might be for you to fork this datafusion-cli code and bundle your UDFs with it.
Or we could publish most of the code for this CLI to crates.io with the idea that people could make their own CLI mashups with a little bit of configuration setup. The ideas are endless!
Love this idea. I had been working on something similar in the past but unfortunately life got in the way so wasnt able to push it as far as I wanted.
I am currently building a new terminal app with DataFusion at the core, but it is more domain specific than general purpose like I think the new proposed tool is expected to be (and if i wasnt working on that, I would probably volunteer to push this forward). That being said, and depending on the direction of this new tool, there could be some potential overlap / common functionality with what im working on. If thats the case then I would be very happy to collaborate on those pieces. I hope to open source what I'm working on in the fall around the time of v0.1 release. I'll keep my eye on this as it progresses though.
Making generic plugin system would be great, but may be a bit of a challenge with lack of stable rust ABI.
Making some kind of CliContextBuilder
where all udfs, table factories ... could be registered may be easier to do.
Maybe existing CliSessionContext
could be extended adding one liner methods for most common extensions
I think this is a great idea.
I don't know if others are interested / if this matches the need but it would be cool to make a "CLI-driven query engine frontend" that isn't coupled to any particular query engine (e.g. it could send queries as either SQL or substrait, potentially via something like adbc).
In other words, it could be something like squirrel / dbeaver for analytics.
I think this is a great idea. Happy to do the work to integrate https://github.com/datafusion-contrib/datafusion-table-providers into it, I think it would be a cool use-case to have.
Will do some homework. Maybe we only need a datafusion-cli repo that one can fork stand-alone to create a proprietary CLI distribution and the plugin system is effectively an overkill
How should we go about dfdb? Do we think we should have "flavors", maybe like the following?
Or should we aim at having a single "flavor" only?
I don't know if others are interested / if this matches the need but it would be cool to make a "CLI-driven query engine frontend" that isn't coupled to any particular query engine (e.g. it could send queries as either SQL or substrait, potentially via something like adbc).
In other words, it could be something like squirrel / dbeaver for analytics.
I like the idea to have a cli frontend that is query engine agnostic (datafusion, duckdb), table agnostic (iceberg, delta), file agnostic (parquet, lance), and even more.
I like the idea to have a cli frontend that is query engine agnostic (datafusion, duckdb), table agnostic (iceberg, delta), file agnostic (parquet, lance), and even more.
This sounds like "Ibis" for CLIs and I think is a neat idea
However, I am personally not likely to work on such a thing -- I am far more interested in showing off / using the DataFusion execution engine more broadly, in part as I think it will drive more use and thus more contributions back to DataFusion
So therefore I am not opposed to creating some sort of pluggable backend query engine system, but I think I would like to focus on the DataFusion one.
I am thinking what I will try to do if no one beats me to it is to sketch out what a dfdb
repo might look like
I like the idea to have a cli frontend that is query engine agnostic (datafusion, duckdb), table agnostic (iceberg, delta), file agnostic (parquet, lance), and even more.
Surely this should just be a CLI that speaks arrow flight SQL? I've tried to persuade ClickHouse to adopt arrow flight SQL to get around their woeful Python client.
That sounds interesting, but I agree with @alamb that that's a different question.
I like the broad idea here, while I like datafusion-cli
, it definitely feels incomplete.
Where I disagree (I think) with @alamb is around what it lacks — my main problem with datafusion-cli
is the "UI" — all the small parts of the best CLIs which make the difference:
\d
or \dt
datafusion-functions-json
would be good enoughThere's one more thing that I've wanted from a database CLI many times and nothing (AFAIK) supports:
Anyway, that's my wish list, as you might guess I've thought about this a fair bit!
One last thing I'll say — virtually all of the above features would be independent of the query engine being used, so there is a good argument to build one great CLI user experience (or let someone else build it) and make it pluggable into any query engine that speaks Arrow Flight SQL, whether that be over a network, or just as an API within a single binary or between dynamically linked libaries.
One last thing I'll say — virtually all of the above features would be independent of the query engine being used, so there is a good argument to build one great CLI user experience (or let someone else build it)
this seems to be the goal of the https://github.com/xo/usql project. I am not saying DF CLI shouldn't try to address those goals, just acknowledging it's both ambitions and potentially not central to the project (DF considers itself a query engine).
Where I disagree (I think) with @alamb is around what it lacks — my main problem with datafusion-cli is the "UI" — all the small parts of the best CLIs which make the difference:
indeed, if the goal is to make a duckdb
replacement this would be a pretty great list for someone to work on
However, as my personal goal is not (at least not yet 😆 ) to make a duckdb
replacement, so I am not likely going to able to focus on this list. I will definitely steal it as "wishlist for dfdb" however 🚀
Surely this should just be a CLI that speaks arrow flight SQL? I've tried to persuade ClickHouse to adopt arrow flight SQL to get around their woeful Python client.
Another possibility might be to leverage ADBC (which has a flight SQL driver). DuckDB implements the ADBC interface, so between that and the other available drivers for ADBC, it might be more beneficial than just using flight SQL?
Love this idea. I had been working on something similar in the past but unfortunately life got in the way so wasnt able to push it as far as I wanted.
Just want to highlight the existing tool authored by @matthewmturner one more time -- perhaps there is no need for another terminal app as the existing one (datafusion-tui) seems to fit the goal (independent contrib app with the focus on the UI / additional plugins), and it just needs more attention/efforts.
Just want to highlight the existing tool authored by @matthewmturner one more time -- perhaps there is no need for another terminal app as the existing one (datafusion-tui) seems to fit the goal (independent contrib app with the focus on the UI / additional plugins), and it just needs more attention/efforts.
I agree -- @matthewmturner what do you think about rekindling the 🔥 around dft
(the name is sufficiently short that I like it!)? https://github.com/datafusion-contrib/datafusion-tui
I am going to be on vacation next week so may have some time to play around with this idea -- perhaps I can make some PRs to that repo?
I would definitely be happy to have work pick back up there. I do believe that it's original goal is generally aligned with the above comments / wish list.
Based on my current work on the other TUI I am working on, which has similar functionality, I think theres some cleanup that will need to be done (updating dependencies and perhaps refactoring the event loop / handler ). I would be happy to drive the efforts on that to get everything back up to date (as well as cleanup the issue backlog).
Separately, I have some code in my current project that helps converting record batches to a Table
format that ratatui understands and I think it would also be helpful here. I had plans to eventually turn this into a sub crate but it probably makes more sense in datafusion-contrib - perhaps in a new repo like datafusion-tui-tools
so that other projects which want to build TUIs with datafusion can benefit (i guess right now thats just my separate project and dft
but maybe this will inspire / enable more domain specific ones which would be cool).
One other note, I had initially made dft
compatible with both datafusion
and ballista
- I think we'll need to decide whether thats still needed. perhaps having a nice client for ballista could help drive more interest in it? @andygrove maybe you have thoughts.
@alamb it would certainly be good to get your view on the current state of the project / any PRs / new issues. An approach that might be practical in the short run is if I focus on getting the core app structure / dependencies updated and to the extent you or anyone else has time if we could get datafusion / arrow / object store updated (its still on the old datafusion-contrib object store).
I'm currently on vacation but would be able to start on this next week if this is ultimately the direction that we decide to go.
@alamb it would certainly be good to get your view on the current state of the project / any PRs / new issues. An approach that might be practical in the short run is if I focus on getting the core app structure / dependencies updated and to the extent you or anyone else has time if we could get datafusion / arrow / object store updated (its still on the old datafusion-contrib object store).
Sounds good - I'll try and update the dependencies as a way to get familiar with the project. This is going to be great
I spent some time poking around with it and it has some quite cool UI -- I think what I am most passionate about is setting up the integrations with iceberg-rs, s3, delta-rs, etc. I am not very good at UIs / interactions
As long as dft can be used to script things (as in run a file of sql commands) I think it will be good for me.
As long as dft can be used to script things (as in run a file of sql commands) I think it will be good for me.
Yes, it can already do that.
I am definitely also passionate about the integrations point (i had some ideas for that in mind from the beginning such as delta, flightsql, and excel.
I think this is a great idea. Happy to do the work to integrate https://github.com/datafusion-contrib/datafusion-table-providers into it, I think it would be a cool use-case to have.
I'd like to highlight the idea of having table providers in datafusion-table-providers
and integrating them in datafusion-tui
from there. This way the providers are also readily available for those embedding DataFusion as a library, which aligns closely with its intended use.
I'd like to highlight the idea of having table providers in datafusion-table-providers and integrating them in datafusion-tui from there. This way the providers are also readily available for those embedding DataFusion as a library, which aligns closely with its intended use.
I agree this sounds like the ideal setup to me -- having them in a separate repo would help keep the boundaries clear as well
@alamb When did you plan on starting to work on this? On my flight home i managed to get a good chunk through a clean rewrite leveraging the setup from my other app which is a much better setup for moving forward. I still have some more work to do but maybe by end of weekend it would be done.
I also may try to pause my other work for a few weeks to focus on dft to see if it can get in a decent shape for the upcoming NYC meetup.
This is the branch with the rewrite.
@alamb When did you plan on starting to work on this?
I keep telling myself "tomorrow" but then I end up getting carried away reviewing all the other good stuff going on (eg.. https://github.com/apache/datafusion/pull/12044 https://github.com/apache/datafusion/pull/12095 etc 🤣 )
THis week I am on vacation, so I have a bit more time for some fun projects (at least that is what I am telling myself)
On my flight home i managed to get a good chunk through a clean rewrite leveraging the setup from my other app which is a much better setup for moving forward. I still have some more work to do but maybe by end of weekend it would be done.
Nice! I'll go check it out now
@alamb i still need to add back the query history / context info and i am improving the ergonomics of navigating query results now.
For anyone interested, i have been on a bit of a fast and furious dev sprint making updates to datafusion-tui. ive basically done a clean rewrite to modernize it and ive gotten the following features to work
localhost:50051
but it will be configurable soon)CREATE EXTERNAL TABLE table STORED AS DELTATABLE
I plan on adding one more feature (new tab for storing query execution results / stats) and then will pause feature development to focus on cleanup / usability improvements / testing / docs / etc.
If anyone wants to help test it would be appreciated else i just keep plugging away :)
if you want all those features you can run cargo r --features=flightsql,s3,deltalake
Somewhat related to a great CLI experience, "leading FROM" duckdb syntax would make any autocomplete much more useful — https://github.com/sqlparser-rs/sqlparser-rs/issues/1400.
@samuelcolvin autocomplete and syntax highlighting are features i would like to support - but i dont expect to get around to them in the short term
I think dft
https://github.com/datafusion-contrib/datafusion-dft seems to be serving this purpose well
I think dft is pretty close, so I am claiming this is done
TLDR
datafusion-cli
in the apache/datafusion repodfdb
(ordatafusion-cli++
ordfcli
) which is purposely designed for running queries against a wide variety of pre-integrated sourcesProblem Statement
As of today,
datafusion-cli
(docs) serves two roles:duckdb
CLI toolIt is really sweet to have a CLI that lets you query a directory of parquet files
However, similarly to the discussion with have had with
datafusion-pytyhon
this dual role leads to a tension between keeping the core lean and easier to embed (e.g. fewer dependencies) and making a better CLI experienceExamples of Friction
I have recently seen some PRs that are basically integrations that would make datafusion-cli a better end user tool, but bring more dependencies and complexity to DataFusion. For example
I realize I have been partly responsible for this mess and for that I apologize.
Proposal
I propose resolving this conflict by creating a new repository for the "CLI tool people actually use"
We would keep
datafusion-cli
as it is, a relatively small and a thin wrapper around the core engine. I don't think we should remove features but we also wouldn't add them (other than what was added to the engine by default)We would add many new features / capabilitues to this
dfdb
toolExamples of new features
There are several obvious examples of integrations that would be super useful for users of a CLI tool but not appropriate for the datafusion repo (due to circular dependencies, for example):
The providers in https://github.com/datafusion-contrib/datafusion-table-providers from @phillipleblanc et al
@philippemnoel actually referrs to the lack of built in Apache Iceberg support in his blog about switching to using duckdb. This is sad given all the code to use datafuson and delta exists, there just isn't a pre-integrated binary that shows how to hook it up and it easy to get up and use
Other cool features
There are many other cool features I have dreamed about adding to a CLI that might be more appropriate in a separate repo. Some ideas to inspire:
CREATE EXTERNAL TABLE
definitions in a file someere (.open <filename>
style)Open questions
Should the new tool be in the
datafusion-contrib
organization or theapache
organization?The tradeoffs are that
datafusion-contrib
could move faster / has less governance overhead, but would also lose the apache communityI personally suggest we start with this tool in the
datafusion-contrib
organization and if there is interest we can discuss bringing it back to the apache organization.