Defunctionalize `%LazySeries{}`

billylanchantin commented 4 months ago

Overview

This PR is an experiment. I wanted to see how hard it would be to remove the callbacks in our lazy implementations. It turned out to be more work than I expected.

Example

The gist can be found in test/explorer/new_approach_test.exs. Here's the "mutate" test from that file:

df0 = DF.new(a: [1, 2, 3])

# Mutate inline
df1 = DF.mutate(df0, b: a + 3)

# Mutate by building a lazy series with functions
b = S.col("a") |> S.add(3)
df2 = DF.mutate_with(df0, b: b)

# Mutate by building a lazy series with a macro
b = Query.new(a + 3)
df3 = DF.mutate_with(df0, b: b)

for df <- [df1, df2, df3] do
  assert DF.to_columns(df, atom_keys: true) == %{a: [1, 2, 3], b: [4, 5, 6]}
end

The Good, the Bad, and the Ugly

The Good

It removes the need for callbacks, which can be inflexible and hard to debug:

# Old
DF.filter_with(df, fn ldf -> S.greater(ldf["a"], 2) end)
# New
DF.filter_with(df, S.greater(S.col("a"), 2))

Now you can pass lazy-backed serires to functions for further manipulation like you do with Ecto queries.
No more unhygenic macros.

The Bad

You cannot know a lazy-backed series has valid dtypes until it's executed.
Explorer.Backend.LazySeries is now a "pseudo-backend", but it's not part of the Explorer.Series behaviour. It feels awkward.

The Ugly

:unknown dtype. :unknown dtype everywhere.
Massively, massively breaking.

Final Thoughts

This PR is covered in hacks and TODOs. There are a few types of tests that won't pass until I refactor some error-handling. And there are some conflicts with main.

But this PR is close enough that I wanted some more eyes on it. It's time to see if the idea has merit.

josevalim commented 4 months ago

Personally, I consider this to be something more like dynamic in Ecto. I wouldn't promote it as the default way of building queries, rather as an alternative. In this case, I would say it is ok to request the dtype and have something like S.col("a", :string). Or even S.string("a") to make it a bit more convenient, so we always have the dtype available. Do you think these would be fair compromises?

billylanchantin commented 4 months ago

I really like :dtype as an arg to col. Same with the S.dtype(name) API, didn't even think of that!

But having it be required might be cumbersome. For one, the macros would be awkward:

normalized = Query.new((a - mean(a)) / standard_deviation(a))
# vs.
normalized = Query.new((col("a", {:f, 64}) - mean(col("a", {:f, 64}))) / standard_deviation(col("a", {:f, 64})))
# or
a = S.col("a", {:f, 64})
normalized = Query.new((^a - mean(^a)) / standard_deviation(^a))

For another, I don't know how much it buys us. True, it prevents me from building a bad lazy-series in the first place:

# works
S.col("string_column") |> S.add(5)
# raises
S.col("string_column", dtype: :string) |> S.add(5)

But the bad column would never be executed because the first thing we'd do is validate the schema:

df = DF.new(string_column: ["a", "b", "c"])
bad_sum = S.col("string_column") |> S.add(5)
# fails dtype validation
DF.mutate(sum: bad_sum)

We should definitely still have that option. But the question is: is it required?

josevalim commented 4 months ago

In my mind we would keep both modes. The S.col/2 would be a convenience that you reach for in certain cases only. because, even if the dtype was not required, S.col is not cleaner than Explorer.Query, given we can use operators, across, cond, and more in queries.

billylanchantin commented 4 months ago

In my mind we would keep both modes.

Absolutely. The real benefit of the approach IMO is the return value of Explorer.Query.query/2. Before it returned an anonymous function. Now it returns a data structure that users can introspect/manipulate. It makes the Series functions properly "monadic" over eager/lazy backings.

The S.col aspect is a sorta side benefit. Since you're now working with lazy structures, you can also make them out of band instead of having to inline everything.

josevalim commented 4 months ago

I may be missing something then.

We can make query not return a function and that should be a relatively small change. We could even make it so accessing a lazy dataframe returns lazy series, so we can build queries without writing S.col. The semantics are well defined internally, we just chose to not expose them.

To me, the big change in this PR is all about :unknown, and if we don't need to do go down that route, I would recommend separating these discussions.

billylanchantin commented 4 months ago

No I don't think you're missing anything!

If any version of this PR is to go forward, it will have to be broken up into manageable pieces. Altering the return of Query.query is a great place to start since it's the core change that sets up the other stuff.

I think the S.col/1 discussion is worthwhile to have later. I've found it to be convenient to work with in Polars and potentially worth the tradeoff. But TBH, I don't think I'd want to even consider tackling it for real without a refactor of our dtype handling. Tacking on K.or(dtype == :unknown) 100's of times is not tenable :) And I think there are wins to be had with a refactor regardless of what we do after.

To clarify though: altering the return of Query.query is the most breaking aspect. No *_with verbs will work anymore since today they expect callbacks.

josevalim commented 4 months ago

It should be easy to extend the _with callbacks to deal with both functions and direct arguments by using guards. So we should start there.

elixir-explorer / explorer