Open piever opened 6 years ago
I agree something like this would really help simplifying the API (no need for two versions of each macro). $
and _I_
sound like better options than col
to me (though _I_
is indeed ugly). Another possibility would be to wrap the full expressions in $
to indicate that they are to be substituted with a value evaluated in a different scope: $(mean(:SepalLength))
.
The "change to colwise scope" is a very nice solution as I don't have to double the notation.
@where iris :SepalLength > $(mean(:SepalLength))
makes it clear that I'm using one or more columns to compute a scalar.
The only thing that this is not covering is the so called "window functions" (functions that take a column of length n
and return a column of length n
but are not element-wise) like lead
, lag
, rank
, sort
etc...
Meaning, how would I write
@where_vec iris :SepalLength .- lag(:SepalLength) .> 1
?
The only thing that comes to mind is that if the expression inside the dollar evaluates to a vector, than I should iterate on it, but I'm not fully convinced.
Do you have some suggestions for this as well? I may have to check how dplyr handles this.
Good point. Maybe iterating is a good rule. More precisely, you could broadcast operations, as if the full expression was wrapped in @.
. Expressions inside $()
would be protected from broadcasting.
dplyr's window functions are documented here: https://dplyr.tidyverse.org/articles/window-functions.html AFAICT, everything is vectorized there, but in R there's no difference between >
and .>
so it doesn't matter (they mention automatic recycling).
The similarity with @.
(esp. "things inside $()
are protected from broadcasting") is quite beautiful. It's also consistent with the fact that broadcast works already for tables:
julia> t = table((a = 1:10, b = rand(10)))
Table with 10 rows, 2 columns:
a b
────────────
1 0.515873
2 0.930648
3 0.402888
4 0.801836
5 0.600595
6 0.801115
7 0.774909
8 0.731416
9 0.572505
10 0.371466
julia> f(row) = row.a*row.b
f (generic function with 2 methods)
julia> f.(t)
10-element Array{Float64,1}:
0.5158734752863601
1.8612967088054502
1.2086637555023616
3.2073426685165565
3.0029751634912216
4.80669118912654
5.424360649636144
5.851328616870312
5.1525439684036645
3.714664066202098
In terms of implementations, I'm still a bit confused. I almost want broadcast but there are two impediments:
broadcast
would collect the wrong way (meaning, it wouldn't perform the "array of struct to struct of arrays" transformation when collecting).Is there a simple way to get the iterator that would result from broadcasting without collecting it?
In terms of meaning however, I'm not sure that I would want to iterate over something other than a vector (and I definitely do not want to get errors from the broadcasting machinery if things return a custom struct for example), so I'm still not sure whether the broadcasting API is a better rule than "iterate if isa AbstractVector
). What rule is used exactly for DataFrames
to add a new column when the user passes a scalar? Say:
julia> df = DataFrame(x = 1:10);
julia> df.y = 3;
~Btw, this could even be the implementation: to create a new table with extra columns and if $(..)
returns a scalar I use a FillArray
for the new column.~ Maybe not, I'd still lose some performance as JuliaDB materializes the NamedTuple
when iterating.
I don't really meant broadcast
should actually be used, just that it's nice to be able to reason in a similar way, even if the implemented behavior is more restrictive.
Is there a simple way to get the iterator that would result from broadcasting without collecting it?
I think that's the point of Broadcast.materialize
to collect such iterators, but I'm not completely sure how it works.
Currently df.y = 3
creates a new column filled with 3
, but it's not clear whether we want to keep it (this code predates the .=
syntax).
@nalimilan had a beautiful suggestion here: https://github.com/JuliaData/DataFrames.jl/issues/1514#issuecomment-423222435.
There may actually be very little need to have separate row-wise and column-wise macros. The row-wise macro could simply also accepts columns (as regular vectors) with a different syntax.
For example, now if we need to filter values for which
:SepalLength
is greater than5
in the datasetiris
we'd do:Whereas if we need to compare with something that require the all column, we'd need to switch to
@where_vec
and add a.
for broadcasting:The idea would be to find a syntax so that we'd only use the row-wise macro but find a way to refer to columns (at macro expand time the symbol is replaced with the corresponding column):
This would be mostly non-breaking but at the same time would make column-wise macros redundant.
I like the idea a lot but am unsure about the syntax. As of now in row-wise macros
_
refers to the row, symbols refer to fields andcols(c)
can be used to instruct the macro thatc
is a variable that evaluates to a symbol, so should be replaced with the field (consistent with DataFramesMeta and StatPlots). In column wise macros_
refers to the table and symbols correspond to columns, andcols(c)
has the corresponding role.What would be an extra syntax to use in row macros?
Candidates:
$SepalLength
col(:SepalLength)
but could be to confusing givencols
_I_.SepalLength
where_I_
would be replaced by a table like object with dot overloading to extract columns? It does look a bit ugly though.In the first to cases I'm also a bit confused how one would do if the column is passed programmatically (by, say,
c=:SepalLength
)