JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.58k stars 5.48k forks source link

Vectorization Roadmap #16285

Closed stevengj closed 7 years ago

stevengj commented 8 years ago

Now that #15032 is merged, here are the main remaining steps discussed in #8450, roughly in order that they should be implemented:

More speculative proposals probably for the 0.6 timeframe (suggested by @yuyichao):

ViralBShah commented 8 years ago

Label created and applied here.

tkelman commented 8 years ago

but does the label mean syntax vectorization or SIMD vectorization?

quinnj commented 8 years ago

How about a SIMD label?

stevengj commented 8 years ago

My thinking was for vectorization at the syntax and library level, not at the instruction level (SIMD).

stevengj commented 8 years ago

For interest, here is Chapel's solution to loop fusion of element-wise operations is apparently based on zippered iterators. Though it's not clear to me whether they actually do fusion ... I'm not sure whether they handle more than calling a single function like + with multiple arguments, as opposed to fusing multiple element-wise function calls.

stevengj commented 8 years ago

And here is ZPL's approach, though it sounds like it is limited to built-in vectorized operations in the language. (It is a compile-time optimization, so it is limited to statements that the compiler can understand and prove things about.) We may eventually want to have something like this optimization so that a few important linear-algebra operations like array + array and array * scalar can fuse without requiring the user to add . explicitly, but it is a very different beast to implement than syntax/parser-level "dot" fusion.

In Haskell, similar fusion optimizations were implemented for operations like map, reduce, etcetera defined in terms of a loopR primitive. (But it relies on function purity analysis and some inlining to succeed.) And another paper describes loop fusion and array operations in SAC (a language which apparently combines the pedantry of strict functional languages with the low-level laboriousness of C).

There is a classic 1995 paper on loop fusion in C++ via template-based metaprogramming tricks (expression templates), but it lacks generality: it only works for objects of a certain vector type, with certain operations on that type that were implemented in advance to use the template fusion technique. This kind of thing is used in Boost and various other C++ libraries.

This 2003 paper on array programming in F-script explicitly discusses the difficulty of loop fusion (viewed as a compiler optimization) in an object-oriented context, because of the difficulty of verifying that fusion does not change the result when confronted with user-defined types and operations.

stevengj commented 8 years ago

@bramtayl, the more I think about it, the more I think that you're right and it would be a good idea to support syntactic fusion for more operations than just broadcast — the same semantics, just a way to substitute another function for broadcast in the lowered expression. Most other people looking at fusion in array languages seem to worry about reduction operations, for example. And in some ways it seems contrary to Julia's spirit to have a single "blessed" function, broadcast, "baked" into the language in a manner that cannot be replicated by a user-defined function.

It would be nice to use some kind of dots, as a syntactic reminder of the connection to dot calls, and foo..(args...) seems available. e.g. sum..(sin.(3 .+ x.^2)) would be equivalent to sum(_ -> sin(3 + _^2), x).

Questions:

Given a choice of syntax etcetera, the implementation of something like sum..(f.(args...)) from above seems like it should be pretty trivial ... I don't think it will require more than 10 lines in julia-syntax.scm.

bramtayl commented 8 years ago

I'm just happy to have had a reasonable idea for once! I think it would be helpful to follow the original syntax. So

reduce..(sin.(3 .+ x.^2), +) would turn into reduce..(sin.(3 .+ x.^2), +, x)

more generally, f..(args...; kwargs...) would turn into f..(args..., mapped_over_objects...; kwargs...)

where a dotted expression hiding in args... or kwargs... would be replaced with its function equivalent.

dotted_expression by itself would be sugar for broadcast..(dotted_expression)

Edit: dotted expressions could also lead to the creation of a ParsedVectorization type, containing an anonymous function and a tuple of containers. Then functional programming functions can have special methods for ParsedVectorization.

stevengj commented 8 years ago

The biggest problem in my mind is the broadcast semantics. e.g. if x is a row vector and y is a column vector, then x .+ y produces a matrix, but if sum..(x .+ y) turns into sum(+, x, y) it seems like the sum function will have to implement the whole broadcast machinery in order to give the result == sum(x .+ y) most people would expect.

On the other hand, if sum..(x .+ y) --> sum(+, x, y) just throws an error (if sum can't handle arguments of different shape), it's not the end of the world. (Right now, sum can't even handle multiple arguments to its function, but that could easily be fixed.)

bramtayl commented 8 years ago

Hmm, well, there could also be a lazy_broadcast function automatically (or as part of the sum call) applied, which would take all the args in what I called mapped_over_arguments above and replace them with non-traditionally indexed views, such that, if x is a row vector and y is a column vector, x[2, 2] would be x[2] and y[2, 2] would be y[1, 2]. I guess this would require faster views.

bramtayl commented 8 years ago

There is also the possibility of trying to get multiple .. calls to work together. Like reduce..(broadcast..(mapslices..())). Here, I think, internal mapslices could be replaced with lazy_mapslices views, and internal broadcast functions could be replaced with lazy_broadcast views. More generally, functional programming functions would be replaced by combinable lazy reshapes/views, and evaluation would be delayed until assignment. Or this could be done with metaprogramming, defining @reduce, @broadcast, @mapslices etc. which could iteratively reoptimize some sort of query of the underlying data.

andyferris commented 8 years ago

it would be a good idea to support syntactic fusion for more operations than just broadcast

How about a macro (in Base), like @reduce, so that @reduce(+, A .+ B) will replace broadcast with mapreduce (we possibly need a broadcastreduce() function?). Seems simplest to me, and it should more-or-less preserve the meaning of .op.

There could be another macro which changes the meaning of the .op, like @fuse(f, expr).

stevengj commented 8 years ago

@andyferris, unfortunately macros alone can't do this, since they only see the syntax before lowering, whereas the conversion of dot calls to broadcast happens during lowering.

andyferris commented 8 years ago

@stevengj Right, but why does that matter? The macro would just have to do a little more work: by-parsing (pun intended...) lowering by implementing the exact same rules but at the macro level, transforming the .+ and .() etc the correct way. Or by doing something a bit different to broadcast, but similar enough that it makes sense (like the @reduce macro I suggested above).

(On a side note, this is making me wonder whether fusing at the language level is strictly necessary, since a macro (in Base) can do it. The important bit is defining the meaning of .() and .op to use broadcast and .= to use broadcast! - we can easily add a @fuse macro to v0.5 (to get the fusing behaviour for .op that we expect to see in v0.6) with zero breakage...).

blairn commented 8 years ago

I'm finding this all very confusing, I've followed to this thread from a request for some way to infix the map function so we can do something like

[1,2,3,4] .|> times2 |> sum

would this mean it is...

[1,2,3,4] |> times2. |> sum? or something else?

bramtayl commented 8 years ago

@blairn I have reasonable ways to do things like that, I think, in ChainMap.jl Given the ideas in this thread, I'll try to include support for non-broadcast functions in @over in the next release

blairn commented 8 years ago

I end up with something that looks like. remap(f) = x -> map(f,x) repmap(f) = x -> pmap(f,x)

then [1,2,3,4] |> remap(times2) |> sum

Except, the chains are about 10 functions long, and have a bunch of Geospatial transforms all through it.

I'll have a look at ChainMap, but I was hoping people would remember where a lot of the requests for this stuff came from. Things seems to have gone well away from the original requests - not in a bad way, but I am worried that we will end up with something which will end up with a bunch of old issues being reopened.

stevengj commented 8 years ago

@andyferris, I'd hate to see the requirement for @fuse macros sprinkled all over the place. And since the semantics depend on whether f.(g.(x)) is fusing or non-fusing, it's hard to change our minds and make it fusing later if it were non-fusing now.

andyferris commented 8 years ago

I'd hate to see the requirement for @fuse macros sprinkled all over the place

@stevengj Very true - I'd prefer it to be automatic. My primary comment was that things other than broadcast can be achieved with a macro rather than .. syntax (it was just a side note that we could use a fusion macro, though as you point out there may be semantic challenges).

bramtayl commented 8 years ago

Another cool thing to do would be to convert

x .= reduce..(args...) to reduce!(x, dot_parsed_args...)

for any functional programming function. To do that would probably require solving the syntax mess around !. Maybe

x .= reduce..(args...) would go to inplace(x, reduce, dot_parsed_args...)

where

inplace(x, f::reduce, args...) = reduce!(x, args...)

that would enable things like

x .= push..(2)

maybe. If I'm not just confusing myself

bramtayl commented 8 years ago

After further thought,

reduce..(broadcast..(mapslices..(args1...), args2...), arg3...)

could do something like call

function_fuse(reduce, broadcast, mapslice)(args1, args2, args3)

Where, for example,

function_fuse(f1::typeof(reduce), f2::typeof(map)) = mapreduce
stevengj commented 8 years ago

The parser can just as easily add a ! rather than transforming to a call to inplace. Either way, however, it is assuming that the requisite method exists.

bramtayl commented 8 years ago

Note that the master version of ChainMap.jl, which only passes for >=0.5, has experimental for "woven" functional programming.

bramtayl commented 8 years ago

OK, I've gotten ChainMap into a place where it's relatively stable, and I'd like to propose an alternate method of dot vectorization based on the work there.

Here's an alternative.

Here's how the syntax would change:

map( ~ sin(~x + y) ) goes to map( LazyCall(x -> sin(x + y), x) ) broadcast( ~ sin(~x + ~y) ) goes to broadcast( LazyCall( (x, y) -> sin(x + y), x, y) ) mapreduce( ( ~ sin(~x + y) ), +) goes to mapreduce( LazyCall( x -> sin(x + y), x), +)

Splats (needs more thought, I think) ~ f(~(x...) ) goes to LazyCall( (x...) -> f(x...), x...) ~ f(~(;x...) ) goes to LazyCall( (;x...) -> f(;x...); x...)

Advantages over the current system: 1) No ambiguity over what counts as a scalar and what doesn't 2) Reduction in the amount of dots necessary for a complicated expression 3) Support for non-broadcast operations 4) Support for passing multiple LazyCalls to one function 5) Partial support for lazy evaluation 6) No need for fusion

Disadvantages: 1) ~ is already taken as bit negation. @~ is taken for formulas. I like the symbol because it reminds me of a woven thread. But maybe another ascii character will do? 2) Not backwards compatible; people will need to learn a new system. 3) Less verbose 4) Not sure about performance impacts

All of this is implemented in ChainMap at the moment, aside from the new methods for map and friends. It can probably stay in a package for now if people aren't interested in adding it to base.

ChrisRackauckas commented 8 years ago

map( ~ sin(~x + y) ) goes to map( LazyCall(x -> sin(x + y), x) ) broadcast( ~ sin(~x + ~y) ) goes to broadcast( LazyCall( (x, y) -> sin(x + y), x, y) ) mapreduce( ( ~ sin(~x + y) ), +) goes to mapreduce( LazyCall( x -> sin(x + y), x), +)

I'm sorry, but I'm not seeing how this proposal would clean up code at all. What would ~x + ~y be? And

2) Reduction in the amount of dots necessary for a complicated expression

Would be true, but there wouldn't be less ~ then there were ..

andyferris commented 8 years ago

While we are on the topic of new ideas, yesterday @c42f and I had some fun playing with the idea of "destructuring" an array.

Sometimes you have a vector of composed types ("structs"), and you might want to refer to the all the values of a certain field. Consider this operation and potential syntax we could implement here with the dot-call idea:

v::Vector{Complex{Float64}}

# get the real components using current syntax
broadcast(x -> x.re, v) 
broadcast(x -> getfield(x, :re), v)  # alternative syntax

# Kind-of works, but need to add scalar-size methods to `Symbol` 
# Also is slow because it doesn't inline the :re properly
getfield.(v, :re) 

# Potential new convenience syntax (dot call on dot reference):
v..re

There is also a similar thing where you might have a vector of tuples or a vector of vectors, and we can currently do something like

v::Vector{Vector{Int}}

# get the first elements
getindex.(v, 1)

# potential new syntax
v.[1]

v_tuple::Vector{Tuple}

# Doesn't seem to be optimal speed (tuple indexing with Const is special)
getindex.(v_tuple, 1)

To conclude: as a performance fix we could (probably should) support faster getfield.() for vectors of structs and faster getindex.() of vectors of vectors/tuples.

As a new syntax proposal, we might consider supporting operators .. and .[] as broadcasted syntax of getfield and getindex respectively.

bramtayl commented 8 years ago

@ChrisRackauckas

Agreed, less verbose, more flexible.

~x + ~y would just be ~x + ~y (it doesn't pass through a macro) ~ ~x + ~y would be LazyCall( (x, y) -> x + y, x, y)

In some cases, the amount of dots compared to ~ is cut down on. Eg. sin.(cos.(tan.(x .+ 1) vs broadcast( ~ sin(cos(tan( ~x + 1) ) ) )

ChrisRackauckas commented 8 years ago

But broadcast( ~ sin(cos(tan( ~x + 1) ) ) ) is still a much worse option because broadcast should not be in the mathematical expression. The syntax is to get rid of the "programming" from the "math". I like to be able to glance at the LaTeX and then glance at the code and see the correctness, while getting performance.

ChrisRackauckas commented 8 years ago

Also, I don't see the reason for the LazyCall in this case. Making the type and the tuple container for the arguments adds to the amount of allocations, when what we really want is simple syntax for allocation-free code.

bramtayl commented 8 years ago

Agreed that the pure math aspect of things gets lost.

There are allocation free ways of going about this:

map( ~ sin(~x + y) ) goes to map( x -> sin(x + y), x) broadcast( ~ sin(~x + ~y) ) goes to broadcast( (x, y) -> sin(x + y), x, y) mapreduce( ( ~ sin(~x + y) ), +) goes to mapreduce( x -> sin(x + y), +, x)

And in fact, ChainMap has some convenience macros for doing this:

@map sin(_ + y) goes to map( _ -> sin(_ + y), _) (_ is used as part of the chaining) @broadcast sin(~x + ~y) goes to broadcast( (x, y) -> sin(x + y), x, y)

But to do this you need to make assumptions about argument order (put the anonymous function first, the woven arguments last).

LazyCalls do have other advantages. They can be merged, pushed to, revised, etc. This is particularly useful for making plot calls.

StefanKarpinski commented 8 years ago

@bramtayl: if you've got a counter-proposal, you should open an issue or start a julia-dev thread.

bramtayl commented 8 years ago

Ok, see #18915

wsshin commented 7 years ago

I have a couple of questions about the third goal in this vectorization roadmap, which is to "parse operators like a .+ b as broadcast(+, a, b)".

julia> @benchmark [1,2,3] .+ 1 BenchmarkTools.Trial: memory estimate: 224.00 bytes allocs estimate: 2

minimum time: 109.144 ns (0.00% GC) median time: 152.579 ns (0.00% GC) mean time: 203.520 ns (19.59% GC) maximum time: 5.679 μs (94.30% GC)

samples: 10000 evals/sample: 916 time tolerance: 5.00% memory tolerance: 1.00%

julia> @benchmark broadcast(+, [1,2,3], 1) BenchmarkTools.Trial: memory estimate: 320.00 bytes allocs estimate: 6

minimum time: 583.749 ns (0.00% GC) median time: 599.648 ns (0.00% GC) mean time: 784.025 ns (14.38% GC) maximum time: 74.077 μs (98.31% GC)

samples: 10000 evals/sample: 179 time tolerance: 5.00% memory tolerance: 1.00%


The above benchmark results were obtained with the following Julia version:

julia> versioninfo() Julia Version 0.6.0-dev.1565 Commit 0408aa2* (2016-12-15 03:02 UTC) Platform Info: OS: macOS (x86_64-apple-darwin16.0.0) CPU: Intel(R) Core(TM)2 Duo CPU T9900 @ 3.06GHz WORD_SIZE: 64 BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NOAFFINITY Penryn) LAPACK: libopenblas64 LIBM: libopenlibm LLVM: libLLVM-3.7.1 (ORCJIT, penryn)

stevengj commented 7 years ago

@wshin, yes, .+ etcetera now (pending merge) works for tuples, but it probably needs some performance work in that case.

julia> (3,4,5) .+ 7
(10,11,12)

julia> (3,4,5) .+ (8,9,10)
(11,13,15)

As for your benchmark, currently there is an inlining failure that is slowing down broadcast (#19608) that should hopefully be fixed soon. As a result, for very small vectors, the overhead is significant.

(Of course, for very small vectors like [1,2,3] you will always get better performance from StaticArrays.jl, since in StaticArrays the length of the vector is known at compile-time too, and the entire loop can be unrolled and inlined, not to mention that the result does not need to be heap-allocated compared to [1,2,3].)

stevengj commented 7 years ago

Closing, since all items are now merged.

Further improvements are possible, such as a generic syntax like sum:(x .+ 5) for fusing any function, making x[x .> 5] fusing, and so on, and of course performance improvements (e.g. @Sacha0's planned work on improving sparseness preserving in broadcast), but they can have their own issues marked with the vectorization label.

stevengj commented 7 years ago

(By the way, the performance problems noted by @wsshin were fixed by #19639.)