[DISCUSSION] We need a Hero for datafusion-python

alamb commented 11 months ago

What this project could be

I think this project needs someone who wants to make a world class python dataframe library and user experience take the helm. I will argue why I think this is a compelling opportunity to make a great piece of technology and have a wide impact across the data analytic space:

What this project could be

I think this project could be one of the most widely used data analysis libraries out there. Imagine a system that allows BOTH a fast dataframe API (ala pol.rs) but also first class SQL support (ala duckdb) that are both screaming fast (due to all the effort that goes into https://github.com/apache/arrow-datafusion) as well as easy to plug into the eco system (arrow / parquet) and extensible (UDFS, UDAs, etc)

DataFusion already posts great benchmark numbers, and I will post datafusion 28.0.0 benchmark when we have them.

How is this different than the mission of DataFusion?

DataFusion is a great project but is currently focused on building the core analytic engine:

DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in Rust, using the Apache Arrow in-memory format.

This repository contains basic python bindings, but the user experience (UX) could be improved in so many ways.

The opportunity

This would be a great opportunity for someone to:

Build some really cool technology
Learn how to help grow an open source project and community with help and guidance from the rest of the DataFusion community
Learn about analytic database technology, Arrow, etc
Influence the direction of Development in DataFusion

mesejo commented 11 months ago

I'm willing to lend a hand 😄. Are there any requirements 😅 ? Recently I've been expanding the coverage of DataFusion by ibis. In the past, I've had (minor) contributions to projects such as dask, xarray, geopandas, and eland. Let me know how I can help.

cpcloud commented 11 months ago

This looks like a great idea!

I'd like to propose that ibis is the library that solves the problem of bringing a delightful and expressive DataFrame API to DataFusion with no loss of SQL functionality. You can actually mix and match ibis expressions and SQL to your heart's content.

As @mesejo mentioned, they've been making great contributions to the DataFusion backend so there's currently some momentum that we can take advantage of.

I'll be up front about it: there's still a lot of work to do, the DataFusion backend is missing a lot of functionality.

The good news is that we've made really easy to see what functionality is missing from any given backend using our backend support matrix app.

Anyone can take a pass at implementing the operations that have a 🚫 in the datafusion column. Some operations will be more challenging than others, and the ibis maintainers (@kszucs, @gforsyth, @jcrist and myself) are here to help.

What do say we ... COALESCE :wink: around ibis as the DataFrame API for DataFusion?

alamb commented 11 months ago

I propose we leave the the decision of where to take this project and what to focus on to whatever hero(s) step forward. What I think datafusion-python needs is someone to invest the time to drive it forward, and the path to take, as in all open source projects, would be largely influenced by the contributors.

I'd like to propose that ibis is the library that solves the problem of bringing a delightful and expressive DataFrame API to DataFusion with no loss of SQL functionality. You can actually mix and match ibis expressions and SQL to your heart's content.

Thank you @cpcloud -- this is an excellent idea and it would be awesome to see the DataFusion ibis backend become more full featured.

I think a very reasonable alternative reality is that "datafusion-python" remains a thin binding on top of datafusion, and the delightful user experience comes via ibis.

What do say we ... COALESCE 😉 around ibis as the DataFrame API for DataFusion?

That is one of cleverest summaries I have seen in a long time. Nicely done 👏

alamb commented 11 months ago

I'm willing to lend a hand 😄. Are there any requirements 😅 ? Recently I've been expanding the coverage of DataFusion by ibis. In the past, I've had (minor) contributions to projects such as dask, xarray, geopandas, and eland. Let me know how I can help.

Thank you @mesejo -- that is great. Like many projects, I think what would be most valuable in this project is

Reviewing PRs and encouraging more involvment
Ensuring the project is easy to both use and contribute such as #432

Maybe you have time to pretend you are a first time user and and figure out what is not clear or where the rough edges are. Ideally you could turn that experience into a guide to help others.

lostmygithubaccount commented 11 months ago

just to throw out an idea related to this:

I think a very reasonable alternative reality is that "datafusion-python" remains a thin binding on top of datafusion, and the delightful user experience comes via ibis.

if we agree Ibis is a delightful dataframe API and we can close the gaps in the DataFusion backend, then you could avoid a lot of work in defining a new dataframe API by wrapping Ibis so that code looks like:

[ins] In [3]: t = datafusion.read_parquet("penguins.parquet")

[ins] In [4]: t
Out[4]:
DatabaseTable: _ibis_read_parquet_pnfkuttmizcmjk7trfkv5bhfse
  species           string
  island            string
  bill_length_mm    float64
  bill_depth_mm     float64
  flipper_length_mm int64
  body_mass_g       int64
  sex               string
  year              int64

[ins] In [5]: datafusion.options.interactive = True

[ins] In [6]: t
Out[6]:
┏━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━┓
┃ species ┃ island    ┃ bill_length_mm ┃ bill_depth_mm ┃ flipper_length_mm ┃ body_mass_g ┃ sex    ┃ year  ┃
┡━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━┩
│ string  │ string    │ float64        │ float64       │ int64             │ int64       │ string │ int64 │
├─────────┼───────────┼────────────────┼───────────────┼───────────────────┼─────────────┼────────┼───────┤
│ Adelie  │ Torgersen │           39.1 │          18.7 │               181 │        3750 │ male   │  2007 │
│ Adelie  │ Torgersen │           39.5 │          17.4 │               186 │        3800 │ female │  2007 │
│ Adelie  │ Torgersen │           40.3 │          18.0 │               195 │        3250 │ female │  2007 │
│ Adelie  │ Torgersen │            nan │           nan │              NULL │        NULL │ NULL   │  2007 │
│ Adelie  │ Torgersen │           36.7 │          19.3 │               193 │        3450 │ female │  2007 │
│ Adelie  │ Torgersen │           39.3 │          20.6 │               190 │        3650 │ male   │  2007 │
│ Adelie  │ Torgersen │           38.9 │          17.8 │               181 │        3625 │ female │  2007 │
│ Adelie  │ Torgersen │           39.2 │          19.6 │               195 │        4675 │ male   │  2007 │
│ Adelie  │ Torgersen │           34.1 │          18.1 │               193 │        3475 │ NULL   │  2007 │
│ Adelie  │ Torgersen │           42.0 │          20.2 │               190 │        4250 │ NULL   │  2007 │
│ …       │ …         │              … │             … │                 … │           … │ …      │     … │
└─────────┴───────────┴────────────────┴───────────────┴───────────────────┴─────────────┴────────┴───────┘

[ins] In [7]: t.group_by(["species", "island"]).agg(datafusion._.count())
Out[7]:
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┓
┃ species   ┃ island    ┃ CountStar(_ibis_read_parquet_pnfkuttmizcmjk7trfkv5bhfse) ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┩
│ string    │ string    │ int64                                                    │
├───────────┼───────────┼──────────────────────────────────────────────────────────┤
│ Adelie    │ Biscoe    │                                                       44 │
│ Adelie    │ Torgersen │                                                       52 │
│ Adelie    │ Dream     │                                                       56 │
│ Chinstrap │ Dream     │                                                       68 │
│ Gentoo    │ Biscoe    │                                                      124 │
└───────────┴───────────┴──────────────────────────────────────────────────────────┘

[ins] In [8]: t.group_by(["species", "island"]).agg(datafusion._.count().name("count"))
Out[8]:
┏━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━┓
┃ species   ┃ island    ┃ count ┃
┡━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━┩
│ string    │ string    │ int64 │
├───────────┼───────────┼───────┤
│ Adelie    │ Biscoe    │    44 │
│ Adelie    │ Torgersen │    52 │
│ Chinstrap │ Dream     │    68 │
│ Gentoo    │ Biscoe    │   124 │
│ Adelie    │ Dream     │    56 │
└───────────┴───────────┴───────┘

mesejo commented 11 months ago

@alamb Thanks for the feedback

Maybe you have time to pretend you are a first time user and and figure out what is not clear or where the rough edges are. Ideally you could turn that experience into a guide to help others.

I open a PR with a draft for the User Guide 😃. While I was writing the guide, I noticed two issues that have a huge impact on the UX and are simple to solve:

The IDE cannot provide hints (or autocompletion) because there is no typing information.
There are no examples of how to use each method (or function)

For solving 1. we could follow the PyO3 guide and add information in .pyi files.

For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).

What are your thoughts?

alamb commented 10 months ago

For solving 1. we could follow the PyO3 guide and add information in .pyi files.

That seems like a very good idea to me

For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).

I think this is likewise a great idea

Thank you @mesejo

kylebarron commented 10 months ago

Just a note that with manual .pyi files you have the endless problem of ensuring that the .pyi files and code match up correctly. Wrapping every rust function in a pure-python function works as polars does but also incurs a ton of overhead (edit: development overhead, not runtime performance overhead). The long-term solution is if pyo3 can emit python type files automatically, as wasm-bindgen does for TypeScript, but that's likely far off

devinjdangelo commented 10 months ago

I’m late to this discussion (and new to this project in general), but the contributions I’ve been focused on over the past month or so have been aimed at solving some of the gaps I see as a heavy python user with a Data Science / Engineering background. Particularly for ETL usecases it needs to be easy to move and transform data between various formats and Object stores leveraging every core available to the maximum extent possible. Default options need to be well tuned since most of these users imo won’t give DataFusion a second look if they run their job and it is much slower than polars or XYZ tool they use currently.

I haven’t gotten to actually looking much at the python interface yet, but it is on my list.

I am very much on board with the vision you describe @alamb.

magarick commented 10 months ago

Hi everyone. I'm happy to help out with this. I think it might be a good idea to get a sense for what people think this should ultimately look like as well as what features they think a good DataFrame library should have. To that end, I've started this issue which hopefully will help gather ideas and fodder for documentation https://github.com/apache/arrow-datafusion-python/issues/462

mesejo commented 9 months ago

Folks! I've created some issues to tackle the missing functions in the Python bindings.

These are a perfect fit for a good first issue, so contributions are more than welcome. (@alamb perhaps we could label the issues as such and promote them on Twitter to increase the involvement with the project?)

alamb commented 9 months ago

Thanks @mesejo -- I marked the tickets as good first issue and posted a tweet: https://twitter.com/andrewlamb1111/status/1699827809462440353

woxiaosa commented 5 months ago

For solving 1. we could follow the PyO3 guide and add information in .pyi files.

That seems like a very good idea to me

For 2. one alternative is to wrap the methods and functions in Python and add docstrings to them (similar to what polars do, see this for example).

I think this is likewise a great idea

Thank you @mesejo

Is there anyone currently adding pyi files for datafusion-python? I have experience in this area and I would like to be involved in this work

alamb commented 5 months ago

Is there anyone currently adding pyi files for datafusion-python? I have experience in this area and I would like to be involved in this work

Thanks @woxiaosa -- I do not know of anyone currently adding pyi files

dlr2 commented 5 months ago

I am also late to this, but as I am trying to evaluate datafusion (Python) I can give some of my input. I am sure that folks know that the documentation is scant, with most API functions having no more than the method name and args (auto extracted from sphinx).

My next idea was to test the SQL vs native expression filtering. Got the SQL to work, but I cannot see how to use an 'and' expr/function. As this is a reserved word I saw no way to apply it or import it. Would the native expr filtering be faster than the equivalent SQL?

So yes, complete examples (showing all the imports, etc) for all the functions and expressions would be great. Hope this is useful

alamb commented 1 month ago

I think github got a little excited about closing this

lostmygithubaccount commented 1 month ago

This PR does not close https://github.com/apache/datafusion-python/issues/440 but it helps to address one part of it.

somebody at GitHub is going to use this as evidence for LLM-based issue closing instead of the current rules

apache / datafusion-python