apache / datafusion

Apache DataFusion SQL Query Engine
https://datafusion.apache.org/
Apache License 2.0
6.37k stars 1.2k forks source link

Proposal: Create `dfdb`, a new CLI different than `datafusion-cli` with pre-built integrations #11979

Closed alamb closed 3 weeks ago

alamb commented 3 months ago

TLDR

Problem Statement

As of today, datafusion-cli (docs) serves two roles:

  1. A debugging / testing tool for the DataFusion query engine developers
  2. A CLI tool for actually doing useful processing if files (locally and remotely using object store), similar to the duckdb CLI tool

It is really sweet to have a CLI that lets you query a directory of parquet files

DataFusion CLI v41.0.0
>
> select "WatchID", "EventDate", "URL" from hits_partitioned limit 10;
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
| WatchID             | EventDate | URL                                                                                                  |
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
| 6904841588848398438 | 15895     | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437                             |
...
| 7551542980199423249 | 15895     | 687474703a2f2f736d6573686172696b692e72752f6d616e756661637475726572363437                             |
+---------------------+-----------+------------------------------------------------------------------------------------------------------+
10 row(s) fetched.
Elapsed 0.059 seconds.

However, similarly to the discussion with have had with datafusion-pytyhon this dual role leads to a tension between keeping the core lean and easier to embed (e.g. fewer dependencies) and making a better CLI experience

Examples of Friction

I have recently seen some PRs that are basically integrations that would make datafusion-cli a better end user tool, but bring more dependencies and complexity to DataFusion. For example

  1. Hugging face https://github.com/apache/datafusion/pull/10792 from @xinlifoobar
  2. FlightSQL: https://github.com/apache/datafusion/pull/11938 from @ccciudatu

I realize I have been partly responsible for this mess and for that I apologize.

Proposal

I propose resolving this conflict by creating a new repository for the "CLI tool people actually use"

We would keep datafusion-cli as it is, a relatively small and a thin wrapper around the core engine. I don't think we should remove features but we also wouldn't add them (other than what was added to the engine by default)

We would add many new features / capabilitues to this dfdb tool

Examples of new features

There are several obvious examples of integrations that would be super useful for users of a CLI tool but not appropriate for the datafusion repo (due to circular dependencies, for example):

Other cool features

There are many other cool features I have dreamed about adding to a CLI that might be more appropriate in a separate repo. Some ideas to inspire:

  1. Local catalog support (imagine if you could store your CREATE EXTERNAL TABLE definitions in a file someere (.open <filename> style)
  2. Local parquet metadata cache (imagine being able to cache the parquet metadata for 100s of files in object store in some sort of persistence format so future queries were fast)
  3. SQL auto completion,
  4. etc.

Open questions

Should the new tool be in the datafusion-contrib organization or the apache organization?

The tradeoffs are that datafusion-contrib could move faster / has less governance overhead, but would also lose the apache community

I personally suggest we start with this tool in the datafusion-contrib organization and if there is interest we can discuss bringing it back to the apache organization.

edmondop commented 3 months ago

I had a related problems a couple of weeks ago, I wanted to distribute a CLI that had some additional udfs bundled with it.

I wonder if we need to set up a plugin system for the CLI. I am happy to make a design proposal about it @alamb

alamb commented 3 months ago

I wonder if we need to set up a plugin system for the CLI. I am happy to make a design proposal about it @alamb

That would be awesome

Another thing that comes to mind might be for you to fork this datafusion-cli code and bundle your UDFs with it.

Or we could publish most of the code for this CLI to crates.io with the idea that people could make their own CLI mashups with a little bit of configuration setup. The ideas are endless!

matthewmturner commented 3 months ago

Love this idea. I had been working on something similar in the past but unfortunately life got in the way so wasnt able to push it as far as I wanted.

I am currently building a new terminal app with DataFusion at the core, but it is more domain specific than general purpose like I think the new proposed tool is expected to be (and if i wasnt working on that, I would probably volunteer to push this forward). That being said, and depending on the direction of this new tool, there could be some potential overlap / common functionality with what im working on. If thats the case then I would be very happy to collaborate on those pieces. I hope to open source what I'm working on in the fall around the time of v0.1 release. I'll keep my eye on this as it progresses though.

milenkovicm commented 3 months ago

Making generic plugin system would be great, but may be a bit of a challenge with lack of stable rust ABI.

Making some kind of CliContextBuilder where all udfs, table factories ... could be registered may be easier to do. Maybe existing CliSessionContext could be extended adding one liner methods for most common extensions

andygrove commented 3 months ago

I think this is a great idea.

westonpace commented 3 months ago

I don't know if others are interested / if this matches the need but it would be cool to make a "CLI-driven query engine frontend" that isn't coupled to any particular query engine (e.g. it could send queries as either SQL or substrait, potentially via something like adbc).

In other words, it could be something like squirrel / dbeaver for analytics.

phillipleblanc commented 3 months ago

I think this is a great idea. Happy to do the work to integrate https://github.com/datafusion-contrib/datafusion-table-providers into it, I think it would be a cool use-case to have.

edmondop commented 3 months ago

Will do some homework. Maybe we only need a datafusion-cli repo that one can fork stand-alone to create a proprietary CLI distribution and the plugin system is effectively an overkill

How should we go about dfdb? Do we think we should have "flavors", maybe like the following?

Or should we aim at having a single "flavor" only?

jayzhan211 commented 3 months ago

I don't know if others are interested / if this matches the need but it would be cool to make a "CLI-driven query engine frontend" that isn't coupled to any particular query engine (e.g. it could send queries as either SQL or substrait, potentially via something like adbc).

In other words, it could be something like squirrel / dbeaver for analytics.

I like the idea to have a cli frontend that is query engine agnostic (datafusion, duckdb), table agnostic (iceberg, delta), file agnostic (parquet, lance), and even more.

alamb commented 3 months ago

I like the idea to have a cli frontend that is query engine agnostic (datafusion, duckdb), table agnostic (iceberg, delta), file agnostic (parquet, lance), and even more.

This sounds like "Ibis" for CLIs and I think is a neat idea

However, I am personally not likely to work on such a thing -- I am far more interested in showing off / using the DataFusion execution engine more broadly, in part as I think it will drive more use and thus more contributions back to DataFusion

So therefore I am not opposed to creating some sort of pluggable backend query engine system, but I think I would like to focus on the DataFusion one.

I am thinking what I will try to do if no one beats me to it is to sketch out what a dfdb repo might look like

  1. basically copy the datafusion-cli's code / docs into its own repo
  2. polish up the code / APIs perhaps to make it clearer how to register custom functions, table providers, etc
samuelcolvin commented 3 months ago

I like the idea to have a cli frontend that is query engine agnostic (datafusion, duckdb), table agnostic (iceberg, delta), file agnostic (parquet, lance), and even more.

Surely this should just be a CLI that speaks arrow flight SQL? I've tried to persuade ClickHouse to adopt arrow flight SQL to get around their woeful Python client.

That sounds interesting, but I agree with @alamb that that's a different question.


I like the broad idea here, while I like datafusion-cli, it definitely feels incomplete.

Where I disagree (I think) with @alamb is around what it lacks — my main problem with datafusion-cli is the "UI" — all the small parts of the best CLIs which make the difference:

There's one more thing that I've wanted from a database CLI many times and nothing (AFAIK) supports:

Anyway, that's my wish list, as you might guess I've thought about this a fair bit!


One last thing I'll say — virtually all of the above features would be independent of the query engine being used, so there is a good argument to build one great CLI user experience (or let someone else build it) and make it pluggable into any query engine that speaks Arrow Flight SQL, whether that be over a network, or just as an API within a single binary or between dynamically linked libaries.

findepi commented 3 months ago

One last thing I'll say — virtually all of the above features would be independent of the query engine being used, so there is a good argument to build one great CLI user experience (or let someone else build it)

this seems to be the goal of the https://github.com/xo/usql project. I am not saying DF CLI shouldn't try to address those goals, just acknowledging it's both ambitions and potentially not central to the project (DF considers itself a query engine).

alamb commented 3 months ago

Where I disagree (I think) with @alamb is around what it lacks — my main problem with datafusion-cli is the "UI" — all the small parts of the best CLIs which make the difference:

indeed, if the goal is to make a duckdb replacement this would be a pretty great list for someone to work on

However, as my personal goal is not (at least not yet 😆 ) to make a duckdb replacement, so I am not likely going to able to focus on this list. I will definitely steal it as "wishlist for dfdb" however 🚀

zeroshade commented 3 months ago

Surely this should just be a CLI that speaks arrow flight SQL? I've tried to persuade ClickHouse to adopt arrow flight SQL to get around their woeful Python client.

Another possibility might be to leverage ADBC (which has a flight SQL driver). DuckDB implements the ADBC interface, so between that and the other available drivers for ADBC, it might be more beneficial than just using flight SQL?

korowa commented 3 months ago

Love this idea. I had been working on something similar in the past but unfortunately life got in the way so wasnt able to push it as far as I wanted.

Just want to highlight the existing tool authored by @matthewmturner one more time -- perhaps there is no need for another terminal app as the existing one (datafusion-tui) seems to fit the goal (independent contrib app with the focus on the UI / additional plugins), and it just needs more attention/efforts.

alamb commented 3 months ago

Just want to highlight the existing tool authored by @matthewmturner one more time -- perhaps there is no need for another terminal app as the existing one (datafusion-tui) seems to fit the goal (independent contrib app with the focus on the UI / additional plugins), and it just needs more attention/efforts.

I agree -- @matthewmturner what do you think about rekindling the 🔥 around dft (the name is sufficiently short that I like it!)? https://github.com/datafusion-contrib/datafusion-tui

I am going to be on vacation next week so may have some time to play around with this idea -- perhaps I can make some PRs to that repo?

matthewmturner commented 3 months ago

I would definitely be happy to have work pick back up there. I do believe that it's original goal is generally aligned with the above comments / wish list.

Based on my current work on the other TUI I am working on, which has similar functionality, I think theres some cleanup that will need to be done (updating dependencies and perhaps refactoring the event loop / handler ). I would be happy to drive the efforts on that to get everything back up to date (as well as cleanup the issue backlog).

Separately, I have some code in my current project that helps converting record batches to a Table format that ratatui understands and I think it would also be helpful here. I had plans to eventually turn this into a sub crate but it probably makes more sense in datafusion-contrib - perhaps in a new repo like datafusion-tui-tools so that other projects which want to build TUIs with datafusion can benefit (i guess right now thats just my separate project and dft but maybe this will inspire / enable more domain specific ones which would be cool).

One other note, I had initially made dft compatible with both datafusion and ballista - I think we'll need to decide whether thats still needed. perhaps having a nice client for ballista could help drive more interest in it? @andygrove maybe you have thoughts.

@alamb it would certainly be good to get your view on the current state of the project / any PRs / new issues. An approach that might be practical in the short run is if I focus on getting the core app structure / dependencies updated and to the extent you or anyone else has time if we could get datafusion / arrow / object store updated (its still on the old datafusion-contrib object store).

I'm currently on vacation but would be able to start on this next week if this is ultimately the direction that we decide to go.

alamb commented 3 months ago

@alamb it would certainly be good to get your view on the current state of the project / any PRs / new issues. An approach that might be practical in the short run is if I focus on getting the core app structure / dependencies updated and to the extent you or anyone else has time if we could get datafusion / arrow / object store updated (its still on the old datafusion-contrib object store).

Sounds good - I'll try and update the dependencies as a way to get familiar with the project. This is going to be great

I spent some time poking around with it and it has some quite cool UI -- I think what I am most passionate about is setting up the integrations with iceberg-rs, s3, delta-rs, etc. I am not very good at UIs / interactions

As long as dft can be used to script things (as in run a file of sql commands) I think it will be good for me.

matthewmturner commented 3 months ago

As long as dft can be used to script things (as in run a file of sql commands) I think it will be good for me.

Yes, it can already do that.

I am definitely also passionate about the integrations point (i had some ideas for that in mind from the beginning such as delta, flightsql, and excel.

backkem commented 3 months ago

I think this is a great idea. Happy to do the work to integrate https://github.com/datafusion-contrib/datafusion-table-providers into it, I think it would be a cool use-case to have.

I'd like to highlight the idea of having table providers in datafusion-table-providers and integrating them in datafusion-tui from there. This way the providers are also readily available for those embedding DataFusion as a library, which aligns closely with its intended use.

alamb commented 3 months ago

I'd like to highlight the idea of having table providers in datafusion-table-providers and integrating them in datafusion-tui from there. This way the providers are also readily available for those embedding DataFusion as a library, which aligns closely with its intended use.

I agree this sounds like the ideal setup to me -- having them in a separate repo would help keep the boundaries clear as well

matthewmturner commented 3 months ago

@alamb When did you plan on starting to work on this? On my flight home i managed to get a good chunk through a clean rewrite leveraging the setup from my other app which is a much better setup for moving forward. I still have some more work to do but maybe by end of weekend it would be done.

I also may try to pause my other work for a few weeks to focus on dft to see if it can get in a decent shape for the upcoming NYC meetup.

matthewmturner commented 3 months ago

This is the branch with the rewrite.

alamb commented 3 months ago

@alamb When did you plan on starting to work on this?

I keep telling myself "tomorrow" but then I end up getting carried away reviewing all the other good stuff going on (eg.. https://github.com/apache/datafusion/pull/12044 https://github.com/apache/datafusion/pull/12095 etc 🤣 )

THis week I am on vacation, so I have a bit more time for some fun projects (at least that is what I am telling myself)

On my flight home i managed to get a good chunk through a clean rewrite leveraging the setup from my other app which is a much better setup for moving forward. I still have some more work to do but maybe by end of weekend it would be done.

Nice! I'll go check it out now

matthewmturner commented 3 months ago

@alamb i still need to add back the query history / context info and i am improving the ergonomics of navigating query results now.

matthewmturner commented 3 months ago

For anyone interested, i have been on a bit of a fast and furious dev sprint making updates to datafusion-tui. ive basically done a clean rewrite to modernize it and ive gotten the following features to work

I plan on adding one more feature (new tab for storing query execution results / stats) and then will pause feature development to focus on cleanup / usability improvements / testing / docs / etc.

If anyone wants to help test it would be appreciated else i just keep plugging away :)

if you want all those features you can run cargo r --features=flightsql,s3,deltalake

samuelcolvin commented 3 months ago

Somewhat related to a great CLI experience, "leading FROM" duckdb syntax would make any autocomplete much more useful — https://github.com/sqlparser-rs/sqlparser-rs/issues/1400.

matthewmturner commented 3 months ago

@samuelcolvin autocomplete and syntax highlighting are features i would like to support - but i dont expect to get around to them in the short term

alamb commented 2 months ago

I think dft https://github.com/datafusion-contrib/datafusion-dft seems to be serving this purpose well

alamb commented 3 weeks ago

I think dft is pretty close, so I am claiming this is done