JuliaData / JuliaDBMeta.jl

Metaprogramming tools for JuliaDB
Other
33 stars 2 forks source link

Merge column-wise and row-wise macros #29

Open piever opened 6 years ago

piever commented 6 years ago

@nalimilan had a beautiful suggestion here: https://github.com/JuliaData/DataFrames.jl/issues/1514#issuecomment-423222435.

There may actually be very little need to have separate row-wise and column-wise macros. The row-wise macro could simply also accepts columns (as regular vectors) with a different syntax.

For example, now if we need to filter values for which :SepalLength is greater than 5 in the dataset iris we'd do:

@where iris :SepalLength > 5

Whereas if we need to compare with something that require the all column, we'd need to switch to @where_vec and add a . for broadcasting:

@where_vec iris :SepalLength .> mean(:SepalLength)

The idea would be to find a syntax so that we'd only use the row-wise macro but find a way to refer to columns (at macro expand time the symbol is replaced with the corresponding column):

@where iris :SepalLength > mean($SepalLength)

This would be mostly non-breaking but at the same time would make column-wise macros redundant.

I like the idea a lot but am unsure about the syntax. As of now in row-wise macros _ refers to the row, symbols refer to fields and cols(c) can be used to instruct the macro that c is a variable that evaluates to a symbol, so should be replaced with the field (consistent with DataFramesMeta and StatPlots). In column wise macros _ refers to the table and symbols correspond to columns, and cols(c) has the corresponding role.

What would be an extra syntax to use in row macros?

Candidates:

In the first to cases I'm also a bit confused how one would do if the column is passed programmatically (by, say, c=:SepalLength)

nalimilan commented 6 years ago

I agree something like this would really help simplifying the API (no need for two versions of each macro). $ and _I_ sound like better options than col to me (though _I_ is indeed ugly). Another possibility would be to wrap the full expressions in $ to indicate that they are to be substituted with a value evaluated in a different scope: $(mean(:SepalLength)).

piever commented 6 years ago

The "change to colwise scope" is a very nice solution as I don't have to double the notation.

@where iris :SepalLength > $(mean(:SepalLength))

makes it clear that I'm using one or more columns to compute a scalar.

The only thing that this is not covering is the so called "window functions" (functions that take a column of length n and return a column of length n but are not element-wise) like lead, lag, rank, sort etc... Meaning, how would I write

@where_vec iris  :SepalLength .- lag(:SepalLength) .> 1

?

The only thing that comes to mind is that if the expression inside the dollar evaluates to a vector, than I should iterate on it, but I'm not fully convinced.

Do you have some suggestions for this as well? I may have to check how dplyr handles this.

nalimilan commented 6 years ago

Good point. Maybe iterating is a good rule. More precisely, you could broadcast operations, as if the full expression was wrapped in @.. Expressions inside $() would be protected from broadcasting.

dplyr's window functions are documented here: https://dplyr.tidyverse.org/articles/window-functions.html AFAICT, everything is vectorized there, but in R there's no difference between > and .> so it doesn't matter (they mention automatic recycling).

piever commented 6 years ago

The similarity with @. (esp. "things inside $() are protected from broadcasting") is quite beautiful. It's also consistent with the fact that broadcast works already for tables:

julia> t = table((a = 1:10, b = rand(10)))
Table with 10 rows, 2 columns:
a   b
────────────
1   0.515873
2   0.930648
3   0.402888
4   0.801836
5   0.600595
6   0.801115
7   0.774909
8   0.731416
9   0.572505
10  0.371466

julia> f(row) = row.a*row.b
f (generic function with 2 methods)

julia> f.(t)
10-element Array{Float64,1}:
 0.5158734752863601
 1.8612967088054502
 1.2086637555023616
 3.2073426685165565
 3.0029751634912216
 4.80669118912654  
 5.424360649636144 
 5.851328616870312 
 5.1525439684036645
 3.714664066202098 

In terms of implementations, I'm still a bit confused. I almost want broadcast but there are two impediments:

Is there a simple way to get the iterator that would result from broadcasting without collecting it?

In terms of meaning however, I'm not sure that I would want to iterate over something other than a vector (and I definitely do not want to get errors from the broadcasting machinery if things return a custom struct for example), so I'm still not sure whether the broadcasting API is a better rule than "iterate if isa AbstractVector). What rule is used exactly for DataFrames to add a new column when the user passes a scalar? Say:

julia> df = DataFrame(x = 1:10);

julia> df.y = 3;

~Btw, this could even be the implementation: to create a new table with extra columns and if $(..) returns a scalar I use a FillArray for the new column.~ Maybe not, I'd still lose some performance as JuliaDB materializes the NamedTuple when iterating.

nalimilan commented 6 years ago

I don't really meant broadcast should actually be used, just that it's nice to be able to reason in a similar way, even if the implemented behavior is more restrictive.

Is there a simple way to get the iterator that would result from broadcasting without collecting it?

I think that's the point of Broadcast.materialize to collect such iterators, but I'm not completely sure how it works.

Currently df.y = 3 creates a new column filled with 3, but it's not clear whether we want to keep it (this code predates the .= syntax).