Closed billylanchantin closed 2 months ago
Personally, I consider this to be something more like dynamic
in Ecto. I wouldn't promote it as the default way of building queries, rather as an alternative. In this case, I would say it is ok to request the dtype and have something like S.col("a", :string)
. Or even S.string("a")
to make it a bit more convenient, so we always have the dtype available. Do you think these would be fair compromises?
I really like :dtype
as an arg to col
. Same with the S.dtype(name)
API, didn't even think of that!
But having it be required might be cumbersome. For one, the macros would be awkward:
normalized = Query.new((a - mean(a)) / standard_deviation(a))
# vs.
normalized = Query.new((col("a", {:f, 64}) - mean(col("a", {:f, 64}))) / standard_deviation(col("a", {:f, 64})))
# or
a = S.col("a", {:f, 64})
normalized = Query.new((^a - mean(^a)) / standard_deviation(^a))
For another, I don't know how much it buys us. True, it prevents me from building a bad lazy-series in the first place:
# works
S.col("string_column") |> S.add(5)
# raises
S.col("string_column", dtype: :string) |> S.add(5)
But the bad column would never be executed because the first thing we'd do is validate the schema:
df = DF.new(string_column: ["a", "b", "c"])
bad_sum = S.col("string_column") |> S.add(5)
# fails dtype validation
DF.mutate(sum: bad_sum)
We should definitely still have that option. But the question is: is it required?
In my mind we would keep both modes. The S.col/2
would be a convenience that you reach for in certain cases only. because, even if the dtype was not required, S.col is not cleaner than Explorer.Query, given we can use operators, across, cond, and more in queries.
In my mind we would keep both modes.
Absolutely. The real benefit of the approach IMO is the return value of Explorer.Query.query/2
. Before it returned an anonymous function. Now it returns a data structure that users can introspect/manipulate. It makes the Series
functions properly "monadic" over eager/lazy backings.
The S.col
aspect is a sorta side benefit. Since you're now working with lazy structures, you can also make them out of band instead of having to inline everything.
I may be missing something then.
We can make query
not return a function and that should be a relatively small change. We could even make it so accessing a lazy dataframe returns lazy series, so we can build queries without writing S.col. The semantics are well defined internally, we just chose to not expose them.
To me, the big change in this PR is all about :unknown
, and if we don't need to do go down that route, I would recommend separating these discussions.
No I don't think you're missing anything!
If any version of this PR is to go forward, it will have to be broken up into manageable pieces. Altering the return of Query.query
is a great place to start since it's the core change that sets up the other stuff.
I think the S.col/1
discussion is worthwhile to have later. I've found it to be convenient to work with in Polars and potentially worth the tradeoff. But TBH, I don't think I'd want to even consider tackling it for real without a refactor of our dtype handling. Tacking on K.or(dtype == :unknown)
100's of times is not tenable :) And I think there are wins to be had with a refactor regardless of what we do after.
To clarify though: altering the return of Query.query
is the most breaking aspect. No *_with
verbs will work anymore since today they expect callbacks.
It should be easy to extend the _with callbacks to deal with both functions and direct arguments by using guards. So we should start there.
Overview
This PR is an experiment. I wanted to see how hard it would be to remove the callbacks in our lazy implementations. It turned out to be more work than I expected.
Example
The gist can be found in
test/explorer/new_approach_test.exs
. Here's the"mutate"
test from that file:The Good, the Bad, and the Ugly
The Good
The Bad
Explorer.Backend.LazySeries
is now a "pseudo-backend", but it's not part of theExplorer.Series
behaviour. It feels awkward.The Ugly
:unknown
dtype.:unknown
dtype everywhere.Final Thoughts
This PR is covered in hacks and TODOs. There are a few types of tests that won't pass until I refactor some error-handling. And there are some conflicts with
main
.But this PR is close enough that I wanted some more eyes on it. It's time to see if the idea has merit.