Closed billylanchantin closed 1 month ago
There is another concern here related to exposing Explorer.Backends.LazyFrame
or whatever we are going to call it. We already to_lazy
and that returns something different. Maybe we should instead have Explorer.Query.new(df)
and we rename Explorer.Backends.LazyFrame
to QueryFrame
?
Then we change filter_with
and friends to accept either an anonymous function or the result of a Explorer.Query.new
(which will be a Explorer.Backends.QueryFrame
). This way we keep everything related to the Explorer.Query
API?
If so, I'd do the following changes:
filter_with
to accept it as an argument beyond functions@josevalim I think you're on the right track. Let me clarify my goal.
I want to make functionality like Polars expressions more 1st class. I want users to be able to directly create/manipulate an expression-like data structure much like you can do with an Ecto.Queryable
. And I want this data structure to replace the callbacks that are the current go-between for our verb
/verb_with
pairs.
We kind of have expressions already in the form of Backend.LazySeries
. But our API is designed such that they're only an implementation detail. I want to either expose them or replace them.
I like the idea of a Explorer.Query.new
or similar that returns this new data structure. But I'm not sure renaming Backend.LazyFrame
as Backend.QueryFrame
is quite right. I think they serve different purposes.
Here's my understanding of the concepts at play:
Polars | Explorer |
---|---|
DataFrame | DataFrame |
LazyFrame | LazyFrame |
Series | Series |
Expression | LazySeries |
All that said, I'm pretty open to other suggestions on how to make this work! As I found out in the other PR, it's a bit of a tricky needle to thread.
I think the table is missing one entry, which is that we need a QueryFrame that, when accessed, returns LazySeries. That’s different from a lazy frame (the result of DF.to_lazy)
Ok thanks this is good food for thought.
I think I need to play with a few options. I want to make sure QueryFrame
s feel ergonomic. And I'm not quite sure how the interact with the other 4 concepts yet.
Once I have something I'll write it up. I'll also maybe move this to an issue. I'm realizing I'm still too much in the designing phase.
I think this PR is almost there. To get started, you could:
And if you change nothing else, it should be what you want. Everything else I mentioned is cleanup/refactoring. It is just that I tend to think bottom-up. :)
I did the naive thing but then I realized that this is now possible:
alias Explorer.{DataFrame, Query, Series}
df1 = DataFrame.new(a: [1, 2, 3])
df2 = DataFrame.new(b: [4, 5, 6])
qf1 = Query.new(df1)
qf2 = Query.new(df2)
c_lazy = Series.add(qf1["a"], qf2["b"])
DataFrame.mutate_with(df1, c: c_lazy)
** (RuntimeError) Polars Error: not found: b: 'with_columns' failed
(explorer 0.10.0-dev) lib/explorer/polars_backend/shared.ex:53: Explorer.PolarsBackend.Shared.apply_dataframe/4
(explorer 0.10.0-dev) lib/explorer/polars_backend/data_frame.ex:659: Explorer.PolarsBackend.DataFrame.mutate_with/3
(explorer 0.10.0-dev) lib/explorer/data_frame.ex:2983: Explorer.DataFrame.mutate_with/3
iex:13: (file)
I'm not sure this is what we want. We'll probably at least want better error handling.
Sorry for the delay!
Notes:
Explorer.Backend.LazyFrame
as Explorer.Backend.QueryFrame
as discussed.
Great @billylanchantin! I have added some feedback to the docs. In a nutshell, the implementation detail docs are going for too long and we have no documentation on how and why to use the Explorer.Query.new
feature. I would focus on the user focusing API and minimize the changes to the implementation details. :)
Ok I think that's better. Thanks for the review :)
Do we still like the LazyFrame
-> QueryFrame
change? Or shall I revert that?
My only remaining question is if we rename to rename LazySeries to QuerySeries but I think they are not equivalent to QueryFrame (as in you can actually use a LazySeries in most series operations, but you can't do that with a QueryFrame).
Yes that's my take too.
"Query" is good in QueryFrame
because all you can do is access, aka query, it. LazySeries
represent computations, so "query" feels wrong to me.
Description
This is step 1 of implementing the ideas from this PR:
From @josevalim:
Before if you wanted to use
filter_with/2
and friends you had to write a callback. You can still do that. But now you can also do:Changes
_with
functions now accept the outputs of their callbacks tooExplorer.Query.new/1
Backend.LazyFrame
asBackend.QueryFrame