Merge column-wise and row-wise macros

piever commented 6 years ago

@nalimilan had a beautiful suggestion here: https://github.com/JuliaData/DataFrames.jl/issues/1514#issuecomment-423222435.

There may actually be very little need to have separate row-wise and column-wise macros. The row-wise macro could simply also accepts columns (as regular vectors) with a different syntax.

For example, now if we need to filter values for which :SepalLength is greater than 5 in the dataset iris we'd do:

@where iris :SepalLength > 5

Whereas if we need to compare with something that require the all column, we'd need to switch to @where_vec and add a . for broadcasting:

@where_vec iris :SepalLength .> mean(:SepalLength)

The idea would be to find a syntax so that we'd only use the row-wise macro but find a way to refer to columns (at macro expand time the symbol is replaced with the corresponding column):

@where iris :SepalLength > mean($SepalLength)

This would be mostly non-breaking but at the same time would make column-wise macros redundant.

I like the idea a lot but am unsure about the syntax. As of now in row-wise macros _ refers to the row, symbols refer to fields and cols(c) can be used to instruct the macro that c is a variable that evaluates to a symbol, so should be replaced with the field (consistent with DataFramesMeta and StatPlots). In column wise macros _ refers to the table and symbols correspond to columns, and cols(c) has the corresponding role.

What would be an extra syntax to use in row macros?

Candidates:

$SepalLength
col(:SepalLength) but could be to confusing given cols
Some sort of dot overloading, like _I_.SepalLength where _I_ would be replaced by a table like object with dot overloading to extract columns? It does look a bit ugly though.

In the first to cases I'm also a bit confused how one would do if the column is passed programmatically (by, say, c=:SepalLength)

nalimilan commented 6 years ago

I agree something like this would really help simplifying the API (no need for two versions of each macro). $ and _I_ sound like better options than col to me (though _I_ is indeed ugly). Another possibility would be to wrap the full expressions in $ to indicate that they are to be substituted with a value evaluated in a different scope: $(mean(:SepalLength)).

piever commented 6 years ago

The "change to colwise scope" is a very nice solution as I don't have to double the notation.

@where iris :SepalLength > $(mean(:SepalLength))

makes it clear that I'm using one or more columns to compute a scalar.

The only thing that this is not covering is the so called "window functions" (functions that take a column of length n and return a column of length n but are not element-wise) like lead, lag, rank, sort etc... Meaning, how would I write

@where_vec iris  :SepalLength .- lag(:SepalLength) .> 1

?

The only thing that comes to mind is that if the expression inside the dollar evaluates to a vector, than I should iterate on it, but I'm not fully convinced.

Do you have some suggestions for this as well? I may have to check how dplyr handles this.

nalimilan commented 6 years ago

Good point. Maybe iterating is a good rule. More precisely, you could broadcast operations, as if the full expression was wrapped in @.. Expressions inside $() would be protected from broadcasting.

dplyr's window functions are documented here: https://dplyr.tidyverse.org/articles/window-functions.html AFAICT, everything is vectorized there, but in R there's no difference between > and .> so it doesn't matter (they mention automatic recycling).

piever commented 6 years ago

The similarity with @. (esp. "things inside $() are protected from broadcasting") is quite beautiful. It's also consistent with the fact that broadcast works already for tables:

julia> t = table((a = 1:10, b = rand(10)))
Table with 10 rows, 2 columns:
a   b
────────────
1   0.515873
2   0.930648
3   0.402888
4   0.801836
5   0.600595
6   0.801115
7   0.774909
8   0.731416
9   0.572505
10  0.371466

julia> f(row) = row.a*row.b
f (generic function with 2 methods)

julia> f.(t)
10-element Array{Float64,1}:
 0.5158734752863601
 1.8612967088054502
 1.2086637555023616
 3.2073426685165565
 3.0029751634912216
 4.80669118912654  
 5.424360649636144 
 5.851328616870312 
 5.1525439684036645
 3.714664066202098

In terms of implementations, I'm still a bit confused. I almost want broadcast but there are two impediments:

sometimes JuliaDB tables are distributed which may require some extra care (being broadcastable is not enough, the array need to be split in memory across the processors in the correct way, though maybe I should worry about this later)
broadcast would collect the wrong way (meaning, it wouldn't perform the "array of struct to struct of arrays" transformation when collecting).

Is there a simple way to get the iterator that would result from broadcasting without collecting it?

In terms of meaning however, I'm not sure that I would want to iterate over something other than a vector (and I definitely do not want to get errors from the broadcasting machinery if things return a custom struct for example), so I'm still not sure whether the broadcasting API is a better rule than "iterate if isa AbstractVector). What rule is used exactly for DataFrames to add a new column when the user passes a scalar? Say:

julia> df = DataFrame(x = 1:10);

julia> df.y = 3;

~Btw, this could even be the implementation: to create a new table with extra columns and if $(..) returns a scalar I use a FillArray for the new column.~ Maybe not, I'd still lose some performance as JuliaDB materializes the NamedTuple when iterating.

nalimilan commented 6 years ago

I don't really meant broadcast should actually be used, just that it's nice to be able to reason in a similar way, even if the implemented behavior is more restrictive.

Is there a simple way to get the iterator that would result from broadcasting without collecting it?

I think that's the point of Broadcast.materialize to collect such iterators, but I'm not completely sure how it works.

Currently df.y = 3 creates a new column filled with 3, but it's not clear whether we want to keep it (this code predates the .= syntax).

JuliaData / JuliaDBMeta.jl

Merge column-wise and row-wise macros #29