Closed stevengj closed 7 years ago
Label created and applied here.
but does the label mean syntax vectorization or SIMD vectorization?
How about a SIMD label?
My thinking was for vectorization at the syntax and library level, not at the instruction level (SIMD).
For interest, here is Chapel's solution to loop fusion of element-wise operations is apparently based on zippered iterators. Though it's not clear to me whether they actually do fusion ... I'm not sure whether they handle more than calling a single function like +
with multiple arguments, as opposed to fusing multiple element-wise function calls.
And here is ZPL's approach, though it sounds like it is limited to built-in vectorized operations in the language. (It is a compile-time optimization, so it is limited to statements that the compiler can understand and prove things about.) We may eventually want to have something like this optimization so that a few important linear-algebra operations like array + array
and array * scalar
can fuse without requiring the user to add .
explicitly, but it is a very different beast to implement than syntax/parser-level "dot" fusion.
In Haskell, similar fusion optimizations were implemented for operations like map
, reduce
, etcetera defined in terms of a loopR
primitive. (But it relies on function purity analysis and some inlining to succeed.) And another paper describes loop fusion and array operations in SAC (a language which apparently combines the pedantry of strict functional languages with the low-level laboriousness of C).
There is a classic 1995 paper on loop fusion in C++ via template-based metaprogramming tricks (expression templates), but it lacks generality: it only works for objects of a certain vector type, with certain operations on that type that were implemented in advance to use the template fusion technique. This kind of thing is used in Boost and various other C++ libraries.
This 2003 paper on array programming in F-script explicitly discusses the difficulty of loop fusion (viewed as a compiler optimization) in an object-oriented context, because of the difficulty of verifying that fusion does not change the result when confronted with user-defined types and operations.
@bramtayl, the more I think about it, the more I think that you're right and it would be a good idea to support syntactic fusion for more operations than just broadcast
— the same semantics, just a way to substitute another function for broadcast
in the lowered expression. Most other people looking at fusion in array languages seem to worry about reduction operations, for example. And in some ways it seems contrary to Julia's spirit to have a single "blessed" function, broadcast
, "baked" into the language in a manner that cannot be replicated by a user-defined function.
It would be nice to use some kind of dots, as a syntactic reminder of the connection to dot calls, and foo..(args...)
seems available. e.g. sum..(sin.(3 .+ x.^2))
would be equivalent to sum(_ -> sin(3 + _^2), x)
.
Questions:
reduce
? Maybe the fusion would only occur on the last argument? So e.g. reduce..(+, sin.(3 .+ x.^2))
would be equivalent to reduce(_ -> sin(3 + _^2), +, x)
? Is this flexible enough?.=
, if at all? Would y .= foo(f.(x)
call foo!(f, y, x)
?broadcast's
flexibility about combining arrays of different shapes? Or is this a feature, e.g. so that you could call map..(f.(g.(x)))
and get map(_ -> f(g(_)), x)
if you want map
rather than broadcast
semantics (e.g. for some arbitrary iterable container)?:
, as in sum:(sin.(3 .+ x.^2))
, but I'm worried about the ambiguity with the ternary operator ... ? ... : ...
.Given a choice of syntax etcetera, the implementation of something like sum..(f.(args...))
from above seems like it should be pretty trivial ... I don't think it will require more than 10 lines in julia-syntax.scm
.
I'm just happy to have had a reasonable idea for once! I think it would be helpful to follow the original syntax. So
reduce..(sin.(3 .+ x.^2), +)
would turn into
reduce..(sin.(3 .+ x.^2), +, x)
more generally,
f..(args...; kwargs...)
would turn into
f..(args..., mapped_over_objects...; kwargs...)
where a dotted expression hiding in args...
or kwargs...
would be replaced with its function equivalent.
dotted_expression
by itself would be sugar for
broadcast..(dotted_expression)
Edit: dotted expressions could also lead to the creation of a ParsedVectorization type, containing an anonymous function and a tuple of containers. Then functional programming functions can have special methods for ParsedVectorization.
The biggest problem in my mind is the broadcast semantics. e.g. if x
is a row vector and y
is a column vector, then x .+ y
produces a matrix, but if sum..(x .+ y)
turns into sum(+, x, y)
it seems like the sum
function will have to implement the whole broadcast
machinery in order to give the result == sum(x .+ y)
most people would expect.
On the other hand, if sum..(x .+ y) --> sum(+, x, y)
just throws an error (if sum
can't handle arguments of different shape), it's not the end of the world. (Right now, sum
can't even handle multiple arguments to its function, but that could easily be fixed.)
Hmm, well, there could also be a lazy_broadcast
function automatically (or as part of the sum call) applied, which would take all the args in what I called mapped_over_arguments
above and replace them with non-traditionally indexed views, such that, if x
is a row vector and y
is a column vector, x[2, 2]
would be x[2]
and y[2, 2]
would be y[1, 2]
. I guess this would require faster views.
There is also the possibility of trying to get multiple ..
calls to work together. Like
reduce..(broadcast..(mapslices..()))
. Here, I think, internal mapslices could be replaced with lazy_mapslices
views, and internal broadcast functions could be replaced with lazy_broadcast
views. More generally, functional programming functions would be replaced by combinable lazy reshapes/views, and evaluation would be delayed until assignment. Or this could be done with metaprogramming, defining @reduce
, @broadcast
, @mapslices
etc. which could iteratively reoptimize some sort of query of the underlying data.
it would be a good idea to support syntactic fusion for more operations than just broadcast
How about a macro (in Base
), like @reduce
, so that @reduce(+, A .+ B)
will replace broadcast
with mapreduce
(we possibly need a broadcastreduce()
function?). Seems simplest to me, and it should more-or-less preserve the meaning of .op
.
There could be another macro which changes the meaning of the .op
, like @fuse(f, expr)
.
@andyferris, unfortunately macros alone can't do this, since they only see the syntax before lowering, whereas the conversion of dot calls to broadcast
happens during lowering.
@stevengj Right, but why does that matter? The macro would just have to do a little more work: by-parsing (pun intended...) lowering by implementing the exact same rules but at the macro level, transforming the .+
and .()
etc the correct way. Or by doing something a bit different to broadcast
, but similar enough that it makes sense (like the @reduce
macro I suggested above).
(On a side note, this is making me wonder whether fusing at the language level is strictly necessary, since a macro (in Base
) can do it. The important bit is defining the meaning of .()
and .op
to use broadcast
and .=
to use broadcast!
- we can easily add a @fuse
macro to v0.5 (to get the fusing behaviour for .op
that we expect to see in v0.6) with zero breakage...).
I'm finding this all very confusing, I've followed to this thread from a request for some way to infix the map function so we can do something like
[1,2,3,4] .|> times2 |> sum
would this mean it is...
[1,2,3,4] |> times2. |> sum? or something else?
@blairn I have reasonable ways to do things like that, I think, in ChainMap.jl Given the ideas in this thread, I'll try to include support for non-broadcast functions in @over
in the next release
I end up with something that looks like. remap(f) = x -> map(f,x) repmap(f) = x -> pmap(f,x)
then [1,2,3,4] |> remap(times2) |> sum
Except, the chains are about 10 functions long, and have a bunch of Geospatial transforms all through it.
I'll have a look at ChainMap, but I was hoping people would remember where a lot of the requests for this stuff came from. Things seems to have gone well away from the original requests - not in a bad way, but I am worried that we will end up with something which will end up with a bunch of old issues being reopened.
@andyferris, I'd hate to see the requirement for @fuse
macros sprinkled all over the place. And since the semantics depend on whether f.(g.(x))
is fusing or non-fusing, it's hard to change our minds and make it fusing later if it were non-fusing now.
I'd hate to see the requirement for
@fuse
macros sprinkled all over the place
@stevengj Very true - I'd prefer it to be automatic. My primary comment was that things other than broadcast
can be achieved with a macro rather than ..
syntax (it was just a side note that we could use a fusion macro, though as you point out there may be semantic challenges).
Another cool thing to do would be to convert
x .= reduce..(args...)
to
reduce!(x, dot_parsed_args...)
for any functional programming function. To do that would probably require solving the syntax mess around !. Maybe
x .= reduce..(args...)
would go to
inplace(x, reduce, dot_parsed_args...)
where
inplace(x, f::reduce, args...) = reduce!(x, args...)
that would enable things like
x .= push..(2)
maybe. If I'm not just confusing myself
After further thought,
reduce..(broadcast..(mapslices..(args1...), args2...), arg3...)
could do something like call
function_fuse(reduce, broadcast, mapslice)(args1, args2, args3)
Where, for example,
function_fuse(f1::typeof(reduce), f2::typeof(map)) = mapreduce
The parser can just as easily add a !
rather than transforming to a call to inplace
. Either way, however, it is assuming that the requisite method exists.
Note that the master version of ChainMap.jl, which only passes for >=0.5, has experimental for "woven" functional programming.
OK, I've gotten ChainMap into a place where it's relatively stable, and I'd like to propose an alternate method of dot vectorization based on the work there.
Here's an alternative.
Here's how the syntax would change:
map( ~ sin(~x + y) )
goes to map( LazyCall(x -> sin(x + y), x) )
broadcast( ~ sin(~x + ~y) )
goes to broadcast( LazyCall( (x, y) -> sin(x + y), x, y) )
mapreduce( ( ~ sin(~x + y) ), +)
goes to mapreduce( LazyCall( x -> sin(x + y), x), +)
Splats (needs more thought, I think)
~ f(~(x...) )
goes to LazyCall( (x...) -> f(x...), x...)
~ f(~(;x...) )
goes to LazyCall( (;x...) -> f(;x...); x...)
Advantages over the current system: 1) No ambiguity over what counts as a scalar and what doesn't 2) Reduction in the amount of dots necessary for a complicated expression 3) Support for non-broadcast operations 4) Support for passing multiple LazyCalls to one function 5) Partial support for lazy evaluation 6) No need for fusion
Disadvantages: 1) ~ is already taken as bit negation. @~ is taken for formulas. I like the symbol because it reminds me of a woven thread. But maybe another ascii character will do? 2) Not backwards compatible; people will need to learn a new system. 3) Less verbose 4) Not sure about performance impacts
All of this is implemented in ChainMap at the moment, aside from the new methods for map and friends. It can probably stay in a package for now if people aren't interested in adding it to base.
map( ~ sin(~x + y) ) goes to map( LazyCall(x -> sin(x + y), x) ) broadcast( ~ sin(~x + ~y) ) goes to broadcast( LazyCall( (x, y) -> sin(x + y), x, y) ) mapreduce( ( ~ sin(~x + y) ), +) goes to mapreduce( LazyCall( x -> sin(x + y), x), +)
I'm sorry, but I'm not seeing how this proposal would clean up code at all. What would ~x + ~y
be? And
2) Reduction in the amount of dots necessary for a complicated expression
Would be true, but there wouldn't be less ~
then there were .
.
While we are on the topic of new ideas, yesterday @c42f and I had some fun playing with the idea of "destructuring" an array.
Sometimes you have a vector of composed types ("structs"), and you might want to refer to the all the values of a certain field. Consider this operation and potential syntax we could implement here with the dot-call idea:
v::Vector{Complex{Float64}}
# get the real components using current syntax
broadcast(x -> x.re, v)
broadcast(x -> getfield(x, :re), v) # alternative syntax
# Kind-of works, but need to add scalar-size methods to `Symbol`
# Also is slow because it doesn't inline the :re properly
getfield.(v, :re)
# Potential new convenience syntax (dot call on dot reference):
v..re
There is also a similar thing where you might have a vector of tuples or a vector of vectors, and we can currently do something like
v::Vector{Vector{Int}}
# get the first elements
getindex.(v, 1)
# potential new syntax
v.[1]
v_tuple::Vector{Tuple}
# Doesn't seem to be optimal speed (tuple indexing with Const is special)
getindex.(v_tuple, 1)
To conclude: as a performance fix we could (probably should) support faster getfield.()
for vectors of structs and faster getindex.()
of vectors of vectors/tuples.
As a new syntax proposal, we might consider supporting operators ..
and .[]
as broadcasted syntax of getfield
and getindex
respectively.
@ChrisRackauckas
Agreed, less verbose, more flexible.
~x + ~y
would just be ~x + ~y
(it doesn't pass through a macro)
~ ~x + ~y
would be LazyCall( (x, y) -> x + y, x, y)
In some cases, the amount of dots compared to ~ is cut down on. Eg.
sin.(cos.(tan.(x .+ 1)
vs broadcast( ~ sin(cos(tan( ~x + 1) ) ) )
But broadcast( ~ sin(cos(tan( ~x + 1) ) ) )
is still a much worse option because broadcast
should not be in the mathematical expression. The syntax is to get rid of the "programming" from the "math". I like to be able to glance at the LaTeX and then glance at the code and see the correctness, while getting performance.
Also, I don't see the reason for the LazyCall
in this case. Making the type and the tuple container for the arguments adds to the amount of allocations, when what we really want is simple syntax for allocation-free code.
Agreed that the pure math aspect of things gets lost.
There are allocation free ways of going about this:
map( ~ sin(~x + y) )
goes to map( x -> sin(x + y), x)
broadcast( ~ sin(~x + ~y) )
goes to broadcast( (x, y) -> sin(x + y), x, y)
mapreduce( ( ~ sin(~x + y) ), +)
goes to mapreduce( x -> sin(x + y), +, x)
And in fact, ChainMap has some convenience macros for doing this:
@map sin(_ + y)
goes to map( _ -> sin(_ + y), _)
(_
is used as part of the chaining)
@broadcast sin(~x + ~y)
goes to broadcast( (x, y) -> sin(x + y), x, y)
But to do this you need to make assumptions about argument order (put the anonymous function first, the woven arguments last).
LazyCall
s do have other advantages. They can be merged, pushed to, revised, etc. This is particularly useful for making plot calls.
@bramtayl: if you've got a counter-proposal, you should open an issue or start a julia-dev thread.
Ok, see #18915
I have a couple of questions about the third goal in this vectorization roadmap, which is to "parse operators like a .+ b
as broadcast(+, a, b)
".
broadcast()
works for tuples, will we have a .+ b
for tuples a
and b
when the third goal is achieved?broadcast(+, a, b)
is ~4 times slower than a .+ b
for vectors a
and b
. (See the benchmark result below.) Is this going to be fixed before the third goal is achieved?
julia> using BenchmarkTools
samples: 10000 evals/sample: 916 time tolerance: 5.00% memory tolerance: 1.00%
samples: 10000 evals/sample: 179 time tolerance: 5.00% memory tolerance: 1.00%
The above benchmark results were obtained with the following Julia version:
julia> versioninfo() Julia Version 0.6.0-dev.1565 Commit 0408aa2* (2016-12-15 03:02 UTC) Platform Info: OS: macOS (x86_64-apple-darwin16.0.0) CPU: Intel(R) Core(TM)2 Duo CPU T9900 @ 3.06GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NOAFFINITY Penryn) LAPACK: libopenblas64 LIBM: libopenlibm LLVM: libLLVM-3.7.1 (ORCJIT, penryn)
@wshin, yes, .+
etcetera now (pending merge) works for tuples, but it probably needs some performance work in that case.
julia> (3,4,5) .+ 7
(10,11,12)
julia> (3,4,5) .+ (8,9,10)
(11,13,15)
As for your benchmark, currently there is an inlining failure that is slowing down broadcast
(#19608) that should hopefully be fixed soon. As a result, for very small vectors, the overhead is significant.
(Of course, for very small vectors like [1,2,3]
you will always get better performance from StaticArrays.jl, since in StaticArrays the length of the vector is known at compile-time too, and the entire loop can be unrolled and inlined, not to mention that the result does not need to be heap-allocated compared to [1,2,3]
.)
Closing, since all items are now merged.
Further improvements are possible, such as a generic syntax like sum:(x .+ 5)
for fusing any function, making x[x .> 5]
fusing, and so on, and of course performance improvements (e.g. @Sacha0's planned work on improving sparseness preserving in broadcast
), but they can have their own issues marked with the vectorization label.
(By the way, the performance problems noted by @wsshin were fixed by #19639.)
Now that #15032 is merged, here are the main remaining steps discussed in #8450, roughly in order that they should be implemented:
broadcast
to be at least as good asmap
(#4883). The trickest part is the ongoing discussion of what to do in the empty-array case (see #11034).@vectorized
functions likesin(x)
in favor ofsin.(x)
. #17302a .+ b
asbroadcast(+, a, b)
. (See also https://github.com/JuliaLang/julia/pull/6929#issuecomment-44099346, #14544, and #17393.) Existing overloaded functions.+(a, b) = ...
can be deprecated in favor of overloadingbroadcast(::typeof(+), a, b)
. #17623More speculative proposals
probably for the 0.6 timeframe(suggested by @yuyichao):f.(args...)
calls as "fusing" broadcast operations at the syntax level. For example,sin.(x .+ cos.(x .^ sum(x.^2)))
would turn (injulia-syntax.scm
) intobroadcast((x, _s_) -> sin(x + cos(x^_s_)), x, sum(broacast(^, x, 2)))
. Notice that thesum
function (or any non-dot call) would be a "fusion boundary." The caller would be responsible for not usingf.(args...)
in cases where fusion would screw up side effects. #17300x .= ...
as "fusing" calls tobroadcast!
, e.g.x .= sin.(x .+ y)
would act in-place with a single loop. Again, this would occur at the syntax level, so the caller would be responsible for avoiding function calls with problematic side effects. (See #7052, but the lack of loop fusion at that time made.+=
much less attractive.) #17510