Open alamb opened 11 months ago
I'm willing to lend a hand π. Are there any requirements π ? Recently I've been expanding the coverage of DataFusion by ibis. In the past, I've had (minor) contributions to projects such as dask, xarray, geopandas, and eland. Let me know how I can help.
This looks like a great idea!
I'd like to propose that ibis is the library that solves the problem of bringing a delightful and expressive DataFrame API to DataFusion with no loss of SQL functionality. You can actually mix and match ibis expressions and SQL to your heart's content.
As @mesejo mentioned, they've been making great contributions to the DataFusion backend so there's currently some momentum that we can take advantage of.
I'll be up front about it: there's still a lot of work to do, the DataFusion backend is missing a lot of functionality.
The good news is that we've made really easy to see what functionality is missing from any given backend using our backend support matrix app.
Anyone can take a pass at implementing the operations that have a π« in the datafusion
column. Some operations will be more challenging than others, and the ibis maintainers (@kszucs, @gforsyth, @jcrist and myself) are here to help.
What do say we ... COALESCE
:wink: around ibis as the DataFrame API for DataFusion?
I propose we leave the the decision of where to take this project and what to focus on to whatever hero(s) step forward. What I think datafusion-python
needs is someone to invest the time to drive it forward, and the path to take, as in all open source projects, would be largely influenced by the contributors.
I'd like to propose that ibis is the library that solves the problem of bringing a delightful and expressive DataFrame API to DataFusion with no loss of SQL functionality. You can actually mix and match ibis expressions and SQL to your heart's content.
Thank you @cpcloud -- this is an excellent idea and it would be awesome to see the DataFusion ibis backend become more full featured.
I think a very reasonable alternative reality is that "datafusion-python" remains a thin binding on top of datafusion, and the delightful user experience comes via ibis.
What do say we ... COALESCE π around ibis as the DataFrame API for DataFusion?
That is one of cleverest summaries I have seen in a long time. Nicely done π
I'm willing to lend a hand π. Are there any requirements π ? Recently I've been expanding the coverage of DataFusion by ibis. In the past, I've had (minor) contributions to projects such as dask, xarray, geopandas, and eland. Let me know how I can help.
Thank you @mesejo -- that is great. Like many projects, I think what would be most valuable in this project is
Maybe you have time to pretend you are a first time user and and figure out what is not clear or where the rough edges are. Ideally you could turn that experience into a guide to help others.
just to throw out an idea related to this:
I think a very reasonable alternative reality is that "datafusion-python" remains a thin binding on top of datafusion, and the delightful user experience comes via ibis.
if we agree Ibis is a delightful dataframe API and we can close the gaps in the DataFusion backend, then you could avoid a lot of work in defining a new dataframe API by wrapping Ibis so that code looks like:
[ins] In [3]: t = datafusion.read_parquet("penguins.parquet")
[ins] In [4]: t
Out[4]:
DatabaseTable: _ibis_read_parquet_pnfkuttmizcmjk7trfkv5bhfse
species string
island string
bill_length_mm float64
bill_depth_mm float64
flipper_length_mm int64
body_mass_g int64
sex string
year int64
[ins] In [5]: datafusion.options.interactive = True
[ins] In [6]: t
Out[6]:
βββββββββββ³ββββββββββββ³βββββββββββββββββ³ββββββββββββββββ³ββββββββββββββββββββ³ββββββββββββββ³βββββββββ³ββββββββ
β species β island β bill_length_mm β bill_depth_mm β flipper_length_mm β body_mass_g β sex β year β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β string β string β float64 β float64 β int64 β int64 β string β int64 β
βββββββββββΌββββββββββββΌβββββββββββββββββΌββββββββββββββββΌββββββββββββββββββββΌββββββββββββββΌβββββββββΌββββββββ€
β Adelie β Torgersen β 39.1 β 18.7 β 181 β 3750 β male β 2007 β
β Adelie β Torgersen β 39.5 β 17.4 β 186 β 3800 β female β 2007 β
β Adelie β Torgersen β 40.3 β 18.0 β 195 β 3250 β female β 2007 β
β Adelie β Torgersen β nan β nan β NULL β NULL β NULL β 2007 β
β Adelie β Torgersen β 36.7 β 19.3 β 193 β 3450 β female β 2007 β
β Adelie β Torgersen β 39.3 β 20.6 β 190 β 3650 β male β 2007 β
β Adelie β Torgersen β 38.9 β 17.8 β 181 β 3625 β female β 2007 β
β Adelie β Torgersen β 39.2 β 19.6 β 195 β 4675 β male β 2007 β
β Adelie β Torgersen β 34.1 β 18.1 β 193 β 3475 β NULL β 2007 β
β Adelie β Torgersen β 42.0 β 20.2 β 190 β 4250 β NULL β 2007 β
β β¦ β β¦ β β¦ β β¦ β β¦ β β¦ β β¦ β β¦ β
βββββββββββ΄ββββββββββββ΄βββββββββββββββββ΄ββββββββββββββββ΄ββββββββββββββββββββ΄ββββββββββββββ΄βββββββββ΄ββββββββ
[ins] In [7]: t.group_by(["species", "island"]).agg(datafusion._.count())
Out[7]:
βββββββββββββ³ββββββββββββ³βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β species β island β CountStar(_ibis_read_parquet_pnfkuttmizcmjk7trfkv5bhfse) β
β‘βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β string β string β int64 β
βββββββββββββΌββββββββββββΌβββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β Adelie β Biscoe β 44 β
β Adelie β Torgersen β 52 β
β Adelie β Dream β 56 β
β Chinstrap β Dream β 68 β
β Gentoo β Biscoe β 124 β
βββββββββββββ΄ββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
[ins] In [8]: t.group_by(["species", "island"]).agg(datafusion._.count().name("count"))
Out[8]:
βββββββββββββ³ββββββββββββ³ββββββββ
β species β island β count β
β‘ββββββββββββββββββββββββββββββββ©
β string β string β int64 β
βββββββββββββΌββββββββββββΌββββββββ€
β Adelie β Biscoe β 44 β
β Adelie β Torgersen β 52 β
β Chinstrap β Dream β 68 β
β Gentoo β Biscoe β 124 β
β Adelie β Dream β 56 β
βββββββββββββ΄ββββββββββββ΄ββββββββ
@alamb Thanks for the feedback
Maybe you have time to pretend you are a first time user and and figure out what is not clear or where the rough edges are. Ideally you could turn that experience into a guide to help others.
I open a PR with a draft for the User Guide π. While I was writing the guide, I noticed two issues that have a huge impact on the UX and are simple to solve:
For solving 1. we could follow the PyO3 guide and add information in .pyi files.
For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).
What are your thoughts?
For solving 1. we could follow the PyO3 guide and add information in .pyi files.
That seems like a very good idea to me
For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).
I think this is likewise a great idea
Thank you @mesejo
Just a note that with manual .pyi
files you have the endless problem of ensuring that the .pyi
files and code match up correctly. Wrapping every rust function in a pure-python function works as polars does but also incurs a ton of overhead (edit: development overhead, not runtime performance overhead). The long-term solution is if pyo3 can emit python type files automatically, as wasm-bindgen does for TypeScript, but that's likely far off
Iβm late to this discussion (and new to this project in general), but the contributions Iβve been focused on over the past month or so have been aimed at solving some of the gaps I see as a heavy python user with a Data Science / Engineering background. Particularly for ETL usecases it needs to be easy to move and transform data between various formats and Object stores leveraging every core available to the maximum extent possible. Default options need to be well tuned since most of these users imo wonβt give DataFusion a second look if they run their job and it is much slower than polars or XYZ tool they use currently.
I havenβt gotten to actually looking much at the python interface yet, but it is on my list.
I am very much on board with the vision you describe @alamb.
Hi everyone. I'm happy to help out with this. I think it might be a good idea to get a sense for what people think this should ultimately look like as well as what features they think a good DataFrame library should have. To that end, I've started this issue which hopefully will help gather ideas and fodder for documentation https://github.com/apache/arrow-datafusion-python/issues/462
Folks! I've created some issues to tackle the missing functions in the Python bindings.
These are a perfect fit for a good first issue, so contributions are more than welcome. (@alamb perhaps we could label the issues as such and promote them on Twitter to increase the involvement with the project?)
Thanks @mesejo -- I marked the tickets as good first issue and posted a tweet: https://twitter.com/andrewlamb1111/status/1699827809462440353
For solving 1. we could follow the PyO3 guide and add information in .pyi files.
That seems like a very good idea to me
For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).
I think this is likewise a great idea
Thank you @mesejo
Is there anyone currently adding pyi files for datafusion-python? I have experience in this area and I would like to be involved in this work
Is there anyone currently adding pyi files for datafusion-python? I have experience in this area and I would like to be involved in this work
Thanks @woxiaosa -- I do not know of anyone currently adding pyi files
I am also late to this, but as I am trying to evaluate datafusion (Python) I can give some of my input. I am sure that folks know that the documentation is scant, with most API functions having no more than the method name and args (auto extracted from sphinx).
My next idea was to test the SQL vs native expression filtering. Got the SQL to work, but I cannot see how to use an 'and' expr/function. As this is a reserved word I saw no way to apply it or import it. Would the native expr filtering be faster than the equivalent SQL?
So yes, complete examples (showing all the imports, etc) for all the functions and expressions would be great. Hope this is useful
I think github got a little excited about closing this
This PR does not close https://github.com/apache/datafusion-python/issues/440 but it helps to address one part of it.
somebody at GitHub is going to use this as evidence for LLM-based issue closing instead of the current rules
What this project could be
I think this project needs someone who wants to make a world class python dataframe library and user experience take the helm. I will argue why I think this is a compelling opportunity to make a great piece of technology and have a wide impact across the data analytic space:
What this project could be
I think this project could be one of the most widely used data analysis libraries out there. Imagine a system that allows BOTH a fast dataframe API (ala pol.rs) but also first class SQL support (ala duckdb) that are both screaming fast (due to all the effort that goes into https://github.com/apache/arrow-datafusion) as well as easy to plug into the eco system (arrow / parquet) and extensible (UDFS, UDAs, etc)
DataFusion already posts great benchmark numbers, and I will post datafusion 28.0.0 benchmark when we have them.
How is this different than the mission of DataFusion?
DataFusion is a great project but is currently focused on building the core analytic engine:
This repository contains basic python bindings, but the user experience (UX) could be improved in so many ways.
The opportunity
This would be a great opportunity for someone to: