elixir-explorer / explorer

Series (one-dimensional) and dataframes (two-dimensional) for fast and elegant data exploration in Elixir
https://hexdocs.pm/explorer
MIT License
1.12k stars 123 forks source link

De-functionalize query internals #989

Closed billylanchantin closed 1 month ago

billylanchantin commented 2 months ago

Description

This is step 1 of implementing the ideas from this PR:

From @josevalim:

We can make query not return a function and that should be a relatively small change. We could even make it so accessing a lazy dataframe returns lazy series, so we can build queries without writing S.col. The semantics are well defined internally, we just chose to not expose them.

Before if you wanted to use filter_with/2 and friends you had to write a callback. You can still do that. But now you can also do:

alias Explorer.{DataFrame, Query, Series}

df = DataFrame.new(a: [1, 2, 3])
qf = Query.new(df)

gt_1 = Series.greater(qf["a"], 1)
lt_3 = Series.less(qf["a"], 3)

df
|> DataFrame.filter_with(gt_1)
|> DataFrame.to_columns(atom_keys: true)
#=> %{a: [2, 3]}

df
|> DataFrame.filter_with(lt_3)
|> DataFrame.to_columns(atom_keys: true)
#=> %{a: [1, 2]}

df
|> DataFrame.filter_with(Series.and(gt_1, lt_3))
|> DataFrame.to_columns(atom_keys: true)
#=> %{a: [2]}

Changes

josevalim commented 2 months ago

There is another concern here related to exposing Explorer.Backends.LazyFrame or whatever we are going to call it. We already to_lazy and that returns something different. Maybe we should instead have Explorer.Query.new(df) and we rename Explorer.Backends.LazyFrame to QueryFrame?

Then we change filter_with and friends to accept either an anonymous function or the result of a Explorer.Query.new (which will be a Explorer.Backends.QueryFrame). This way we keep everything related to the Explorer.Query API?

If so, I'd do the following changes:

billylanchantin commented 2 months ago

@josevalim I think you're on the right track. Let me clarify my goal.

I want to make functionality like Polars expressions more 1st class. I want users to be able to directly create/manipulate an expression-like data structure much like you can do with an Ecto.Queryable. And I want this data structure to replace the callbacks that are the current go-between for our verb/verb_with pairs.

We kind of have expressions already in the form of Backend.LazySeries. But our API is designed such that they're only an implementation detail. I want to either expose them or replace them.

I like the idea of a Explorer.Query.new or similar that returns this new data structure. But I'm not sure renaming Backend.LazyFrame as Backend.QueryFrame is quite right. I think they serve different purposes.

Here's my understanding of the concepts at play:

Polars Explorer
DataFrame DataFrame
LazyFrame LazyFrame
Series Series
Expression LazySeries

All that said, I'm pretty open to other suggestions on how to make this work! As I found out in the other PR, it's a bit of a tricky needle to thread.

josevalim commented 2 months ago

I think the table is missing one entry, which is that we need a QueryFrame that, when accessed, returns LazySeries. That’s different from a lazy frame (the result of DF.to_lazy)

billylanchantin commented 2 months ago

Ok thanks this is good food for thought.

I think I need to play with a few options. I want to make sure QueryFrames feel ergonomic. And I'm not quite sure how the interact with the other 4 concepts yet.

Once I have something I'll write it up. I'll also maybe move this to an issue. I'm realizing I'm still too much in the designing phase.

josevalim commented 2 months ago

I think this PR is almost there. To get started, you could:

  1. add Explorer.Query.new as an alias to LazyFrame.new
  2. Add new clauses to the _with functions
  3. Add the access behavior

And if you change nothing else, it should be what you want. Everything else I mentioned is cleanup/refactoring. It is just that I tend to think bottom-up. :)

billylanchantin commented 1 month ago

I did the naive thing but then I realized that this is now possible:

alias Explorer.{DataFrame, Query, Series}

df1 = DataFrame.new(a: [1, 2, 3])
df2 = DataFrame.new(b: [4, 5, 6])

qf1 = Query.new(df1)
qf2 = Query.new(df2)

c_lazy = Series.add(qf1["a"], qf2["b"])

DataFrame.mutate_with(df1, c: c_lazy)
** (RuntimeError) Polars Error: not found: b: 'with_columns' failed
    (explorer 0.10.0-dev) lib/explorer/polars_backend/shared.ex:53: Explorer.PolarsBackend.Shared.apply_dataframe/4
    (explorer 0.10.0-dev) lib/explorer/polars_backend/data_frame.ex:659: Explorer.PolarsBackend.DataFrame.mutate_with/3
    (explorer 0.10.0-dev) lib/explorer/data_frame.ex:2983: Explorer.DataFrame.mutate_with/3
    iex:13: (file)

I'm not sure this is what we want. We'll probably at least want better error handling.

billylanchantin commented 1 month ago

Sorry for the delay!

Notes:

josevalim commented 1 month ago

Great @billylanchantin! I have added some feedback to the docs. In a nutshell, the implementation detail docs are going for too long and we have no documentation on how and why to use the Explorer.Query.new feature. I would focus on the user focusing API and minimize the changes to the implementation details. :)

billylanchantin commented 1 month ago

Ok I think that's better. Thanks for the review :)

Do we still like the LazyFrame -> QueryFrame change? Or shall I revert that?

billylanchantin commented 1 month ago

My only remaining question is if we rename to rename LazySeries to QuerySeries but I think they are not equivalent to QueryFrame (as in you can actually use a LazySeries in most series operations, but you can't do that with a QueryFrame).

Yes that's my take too.

"Query" is good in QueryFrame because all you can do is access, aka query, it. LazySeries represent computations, so "query" feels wrong to me.