apache / datafusion-python

Apache DataFusion Python Bindings
https://datafusion.apache.org/python
Apache License 2.0
329 stars 64 forks source link

User Stories for Interface / Feature Design and Documentation #462

Open magarick opened 11 months ago

magarick commented 11 months ago

As discussed in the rust arrow chat, this is a place to chart a way forward by collecting examples of both what people do in other libraries and what they want to do but can't do easily with current tools. In addition to clarifying Python interface requirements, I hope it provides fodder for lower-level functions, encourages knowledgeable folks to explain how to do things (which can then get documented), and clarifies what is important to what types of people and why. There's a daunting amount of work, but the opportunity and potential are tremendous.

For contributors, I'd like to structure this roughly as follows:

OK, now it's my turn to start. My background is in statistics, machine learning, and data science. Most of what I do is focused on modeling and analyzing data, though I've done a good bit of pipeline and data processing/cleaning too. As such, I place a premium on in-memory interactive work (notebooks, rmarkdown, etc.). Partly because of this, I think a lot of tools mistake verboseness for clarity, and striving for conciseness often helps readability rather than hurting it if done correctly.

I have the most experience with R's data.table but I've also used dplyr, polars, pandas, and Julia. So here's sampling of a few things I would want in a dataframe library in no particular order.

  1. Here's an example in data.table that shows features I think are both good and bad.
    > dt1 = data.table(t = 1:5, v = 5:1)
    > dt2 = data.table(start = c(1, 4), end = c(3, 10), x = c("a", "b"))
    > dt1[dt2, x := i.x, on = .(t >= start, t < end)]
    > dt1
    t v    x
    1: 1 5    a
    2: 2 4    a
    3: 3 3 <NA>
    4: 4 2    b
    5: 5 1    b

First, I find non-equi joins, especially range joins incredibly useful. They're common in SQL but a lot of dataframe libraries don't have them. data.table also makes it easy to update in place with the := operator, which can be used to create new columns as well as update existing ones. As I understand it, arrow strives for immutability, but at the same time, it won't make copies of the whole frame if it doesn't have to, so maybe this is less of an issue. However, I do like the idea of using a join like this to explicitly tag/annotate another table.

  1. Reshaping data. These functions transform data between "wide" to "long" (sometimes known as "tidy") formats. Sometimes they go by cast/melt or pivot and there's even a simple transpose function in a lot of packages. It doesn't seem common in database-world, but to me reshaping in-memory data is important for a lot of use cases.

  2. Rolling groupbys. Both Pandas and Polars have pretty good support for creating overlapping groups and aggregating over them. These are commonly used for time series analysis. I also think the ability to define groups not just by a number of rows, but by a potentially variable-width lookback (like, at most 1 month before the current date) is useful. Polars does a pretty good job at this, and I think Pandas might too.

Alright, that's a few to get started and this is long enough as it is. I'm looking forward to seeing what everyone thinks is important, their thoughts on good DataFrame API design, and what is and isn't currently possible in DataFusion.

lostmygithubaccount commented 11 months ago

Hello,

My name is Cody and I'm a Technical Product Manager at Voltron Data. My background is in electrical engineering and data science. For nearly a decade I've used Python and gone through a journey of using pandas, PySpark, Dask, and more recently being interested in DataFusion/Polars. While I've never had a real data engineering or data science job, I've done work in those areas and worked with many engineers in them. I've written tons of examples with these frameworks.

Full disclose: I primarily work on and Voltron Data is invested in the success of Ibis.

My worry here is that you and the DataFusion community are going to go through the struggles that every new Python dataframe library does, and start facing the same type of questions -- is it groupby or group_by? to_csv or write_csv? The list goes on. I've seen the fragmentation of the Python data community over the years and would be far more excited to work on a standard API that supports many backends (Ibis) than bringing another Python dataframe library to the table.

The good news, if going in this direction, is that a lot of the work is done! DataFusion in Ibis is fairly functional (thanks to lots of community contributions), though there still is a ways to go before it's up to par with the DuckDB backend. Here's a quick example notebook: https://gist.github.com/lostmygithubaccount/73414943a71ffb471605c3132eda0727

We'd love to have more collaboration on Ibis for the DataFusion backend if that'd be an interesting direction to you and others. Ibis was created by Wes McKinney (creator of pandas) and taken an opinionated stance on most issues I suspect you'll face with a new dataframe library. Plus, it takes heavy inspiration from R and other previous tools! Let us know if this would be interesting to you.

It's already integrated with visualization frameworks (Altair, Plotly, Streamlit -- any that support the __dataframe__ protocol natively, and any others through to_pandas()) and ML frameworks (scikit-learn, XGBoost, more in this area coming soon).

On more edit -- just saw your comment on Polars about the POSE grant -- this is something we are in the process of applying for. While right now the core maintainers of Ibis are all employed by Voltron Data, it is not the company's intent to own the project -- we want the same model as Apache Arrow eventually. We're working on a governance structure and increasing involvement from other companies. Hope this is helpful!

magarick commented 11 months ago

Hi Cody! Thanks for your interest in this. I've seen a little bit of Ibis and it looks interesting. I'm also not sure improving Ibis support and making a better "native" API are conflicting goals.

My worry here is that you and the DataFusion community are going to go through the struggles that every new Python dataframe library does, and start facing the same type of questions -- is it groupby or group_by? to_csv or write_csv? The list goes on. I've seen the fragmentation of the Python data community over the years and would be far more excited to work on a standard API that supports many backends (Ibis) than bringing another Python dataframe library to the table.

These differences, at least as you've described them here, seems more like a mild annoyance than a struggle to me. As long as there's reasonable documentation, I've never found slightly different names to be nearly as big a barrier as identical or similarly named things behaving differently, or differing capabilities across libraries.

We'd love to have more collaboration on Ibis for the DataFusion backend if that'd be an interesting direction to you and others. Ibis was created by Wes McKinney (creator of pandas) and taken an opinionated stance on most issues I suspect you'll face with a new dataframe library. Plus, it takes heavy inspiration from R and other previous tools! Let us know if this would be interesting to you.

I'm not opposed to this at all, especially if Ibis can provide a consistent API while still exposing the full power of each underlying library. At some point, though, it seems like you'll encounter differences that preclude a uniform interface or require a specialized API for a unique feature Ibis doesn't support. However, I can see the appeal if you have people who occasionally use a large number of backends or are trying to build something that can interact with multiple systems. So I'd be surprised if there weren't value to both a native interface that exposed all of a tool's power and a universal interface since they seem to be solving different problems. If I'm wrong about Ibis' goals and capabilities, please do correct me though.

It's already integrated with visualization frameworks (Altair, Plotly, Streamlit -- any that support the __dataframe__ protocol natively, and any others through to_pandas()) and ML frameworks (scikit-learn, XGBoost, more in this area coming soon).

Glad that you brought this up. What's the relationship between Ibis and the Python dataframe standards protocol?

gforsyth commented 11 months ago

What's the relationship between Ibis and the Python dataframe standards protocol?

Ibis is one of many voices at that particular table. We have robust support for the dataframe interchange protocol (__dataframe__) and we're interested in adopting the dataframe standard compat stuff, the __dataframe_consortium_standard__ -- although at the moment, the latter is designed with an eye towards eager computation and it isn't clear how it will play with lazy / deferred computation.

lostmygithubaccount commented 11 months ago

Edit: this article by Wes McKinney (and the linked paper) will explain far better than I can on the overall vision!

I'm also not sure improving Ibis support and making a better "native" API are conflicting goals.

I don't disagree, but I think they are duplicative efforts. My overall concern here is that the DataFusion project will spent a lot of time and energy on re-hashing a lot of these discussions about what a great Python dataframe library should be and this is bad for the ecosystem overall.

Ibis's primary innovation is portability, decoupling the dataframe API from the execution engine. While this might not be a big deal for an individual developer who can learn pandas & Polars & Snowpark & PySpark APIs, it represents a major source of siloing and duplication of efforts across the ecosystem. I've heard many cases of data scientists "throwing pandas code overall the wall" for data/ML engineers to rewrite in PySpark. With the Ibis projects (and other open-source projects) we (the Voltron Data we) are hoping to create a more modular and composable ecosystem like this:

image

I do think it's totally valid for DataFusion to pursue its own Python dataframe API, I just think it would be better to spend effort improving DataFusion as an execution engine and leveraging all the work already done in Ibis for the user-facing API!

Very fair concern on how Ibis achieves what it claims -- we're actually in the process of moving our documentation to Quarto and I'd love your feedback on the "Why Ibis?" page and "Backends concept page" that explains it -- in short, SQL dialects and tooling for them (sqlalchemy, sqlglot, etc.) are close enough and manageable enough with the scalable API Ibis exposes. For the most part, Python expressions are compiled to SQL.