JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.58k stars 5.48k forks source link

Vectorization Roadmap #16285

Closed stevengj closed 7 years ago

stevengj commented 8 years ago

Now that #15032 is merged, here are the main remaining steps discussed in #8450, roughly in order that they should be implemented:

More speculative proposals probably for the 0.6 timeframe (suggested by @yuyichao):

andyferris commented 8 years ago

Regarding the third point, can we do .= as broadcast! without worrying overly about loop fusion?

As in A .= B as broadcast!(x->x, A, B) as an element-wise copy. Of course, more complex expressions would require more complex loop fusion, but this gives some moderate gain. .+= and so-on are very low-hanging fruit for at least removing one layer of allocation very simply.

(Hmm... that last one makes me think if .sin= is good syntax? As in A .sin= B being like Julia 0.4 A[:] = sin(B[:]) I'm being more than a bit silly here, but it works similarly to the . prefix version like .sin(x) rather than suffix sin.(x). It even kinda nests: A .exp.sin= x. Quite ugly though, and messes with scoping/field syntax)

stevengj commented 8 years ago

@andyferris, without loop fusion, an in-place .= is pretty useless. An expression like x .= 4y .+ 5y.^2 .- sin.(y) will still allocate lots of temporary arrays. If you just want element-wise copy, you can already do A[:] = B.

andyferris commented 8 years ago

@stevengj I agree that it is (almost) useless, except for an alternative copy syntax. It's just "step zero" - it's dirt simple and provides the precedence for "step one" as doing similarly for .+= and all the other op= with a prefix ., which then starts to be useful, for things like x .+= 1 or something to be non-allocating where x is an Array. In fact, the number of op= operators currently defined is relatively small... so it shouldn't be too hard to define them all.

"Step 2" would be the more complex loop fusion or similar, and perhaps generalizations to generic functions.

On the whole, I do have to support what Jeff said in #14544 (comment). Perhaps the cleaner approach is to have some neat syntax for map/broadcast (and hopefully where loop fusion is clear to the user or compiler). I dunno.

stevengj commented 8 years ago

@andyferris, that discussion is out of date. We now do have a neat syntax for broadcast, and the possibility for loop fusion is now exposed to the compiler at the syntax level. And the since that syntax is ., generalized dot operators now make a lot more sense.

andyferris commented 8 years ago

@stevengj Sorry, I missed the last week or so of #8450 (it's really hard to keep up with all the threads!).

In any case, the progress seems very cool! The parser-based loop fusion (e.g. your #8450 (comment) seems like a great option to me. Any more complex things can still be done with loops, comprehensions, map and broadcast explicitly.

diegozea commented 8 years ago

A @fuse macro can be more safe than making the loop fusion at parser level, can't it?

yuyichao commented 8 years ago

What's the advantage? We already have that Devectorize.jl and I don't think It will be more flexible or give you more control.

diegozea commented 8 years ago

I'm only worry about making loop fusion everywhere. "The caller would be responsible for avoiding function calls with problematic side effects" but... How can the user use the point notation and avoid the fusion at the same time? Also I think that @fuse will be more explicit to write and to read in the code. @fuse can do the same proposed loop fusion, but It can also check the functions to be called (i.e. annotated pure functions). The last isn't possible at syntax level.

yuyichao commented 8 years ago

The caller would be responsible for avoiding function calls with problematic side effects

I think this is not an issue for the majority of cases.

How can the user use the point notation and avoid the fusion at the same time?

Split the line

but It can also check the functions to be called (i.e. annotated pure functions).

This is impossible. Or at least it won't be better at doing that.

stevengj commented 8 years ago

@diegozea, macros operate at the syntax level, without knowing types, so they can't check pureness. Same for the proposed f.(args...) fusing. (In practice, though, vectorized operations are virtually never used in cases with side effects, much less side effects that would be effected by fusion. And if f.(args...) is defined as always fusing, then you can reliably know what to expect.)

Think of f.(args...) as a much more compact and discoverable syntax for @fuse. (Once we deprecate e.g. sin(x) in favor of sin.(x) etcetera, people doing Matlab-like computations will end up using fused loops automatically in many case due to the deprecation warning, whereas they might never learn about @fuse.)

stevengj commented 8 years ago

Another way of saying it is that fusing the loops is a good idea in the vast majority of cases. Not wanting to fuse is the rare exception. That makes it sensible to make fusing the default, and in the rare cases where you don't want to fuse you can either split the line or just call broadcast explicitly.

(Note that this whole discussion was not even possible before the . notation. If you write f(g(x)) in 0.4, there is no practical generic way to look "inside" f and g and discover that they are broadcast operations that ought to be fused, whereas f.(g.(x)) makes the user's intent clear at the syntax level, enabling a syntax-level fusing transformation.)

diegozea commented 8 years ago

Sorry, I meant parse not syntax level in my last sentence. However I believed that a macro could access a function metadata.

yuyichao commented 8 years ago

However I believed that a macro could access a function metadata.

No it can't. For a start, it doesn't even have any idea about the binding. It is allowed to do sin = cos; sin(1) (or a much more reasonable form of this). @pure is also per method and you can't check that without knowing all the type info.

stevengj commented 8 years ago

@diegozea, macros are called at the parse level, which is what we mean by the "syntax" level here. i.e. at the point the macro is called, all that is known is the abstract syntax tree (AST); there is no information about types or bindings.

martinholters commented 8 years ago

While fusion by default sounds like a good idea, I find it a bit unsettling that

y = g.(f.(x))

and

a = f.(x)
y = g.(a)

might give different results. Is that just me?

nalimilan commented 8 years ago

Not different results, just different ways of returning the same result. That's fine IMHO. In the long-term, the compiler might become smarter and detect more complex cases.

martinholters commented 8 years ago

@nalimilan You are assuming pure f and g.

johnmyleswhite commented 8 years ago

FWIW, I suspect most functions people are vectorizing are pure.

quinnj commented 8 years ago

Maybe we need to do some more exploration around function traits (purity, guaranteed return types, boolean, etc.) and how they could be incorporated into some of these murkier discussions (vectorization, return types, etc.). It seems like having some explicit declarations around function traits would avoid the need to rely on inference or the compiler.

andyferris commented 8 years ago

I'm worried that overly complex rules for defining how loop fusion works will just become too confusing for users. The suggestion that A .= B .+ C ... becomes syntactic sugar for map/broadcast means that, once we are all used to it, we will easily be able to reason about code we see and know how to write a simple fused loop expression for vectors of data.

If it is a simple parsing-level rule, then the differences in @martinholters's comment will be obvious. If it is compile-time magic, we will be spending a lot of time trying to figure out if the compiler is really doing what we want it to do.

But coming back to nested . functions/operators, would we be able to avoid allocation when matrix multiplication is in the middle, e.g. v1 .= v2 .+ m1*v3 (where m3 is a matrix, and v* are vectors)? Correct me if I'm wrong, but isn't this BLAS's daxpy? We would want that to be non-allocating, so hopefully whatever syntax we come up with would allow for this kind of thing (where the order of multiplication and additions for the matrix multiplication is cache-optimized as-in BLAS, not direct as-in map and broadcast).

yuyichao commented 8 years ago

@andyferris Yes, the hope is that the parser level transformation can cover most use case with a well defined schematics. It is in principle possible to add support for more complicated construct (I very briefly talked about this with @andreasnoack ) but I personally feel like it is hard to come up with a syntax that can cover all the cases besides what can be currently achieved with broadcast (or a mutating version of it).

Thinking loudly, maybe it can be achieved by having a lazy array type (so the computation is done on the fly with in boardcast)? That would be an orthogonal change though. It'll also be hard to recognize that and call BLAS functions.

stevengj commented 8 years ago

@andyferris, the whole point is that the proposed loop fusion becomes a simple parsing-level guarantee, not compile-time at all. If you see f.(g.(x)), then you know that the loops are always fused into a single broadcast, regardless of the bindings of f, g, or x. This allows you to reason simply about the code consistently. It is indeed just syntactic sugar.

This is very different from a compile-time optimization that may or may not occur.

andyferris commented 8 years ago

@stevengj Yes, I understand and agree completely (my first two paragraphs were directed at @quinnj).

the whole point is that the proposed loop fusion becomes a simple parsing-level guarantee,

To my taste, the keyword is "guarantee". I really do like it.

I was more thinking along the lines of what @yuyichao was thinking "loudly". If Matrix-Vector multiplication was lazy (returned an iterable over i for sum(M[i,:] .* v) or similar) then it would "just work" with no temporaries given a combination of your suggested changes to the parser and introducing the lazy multiplication type (and thus no "compile-time optimizations" are necessary beyond what already exists in Julia). It would be interesting to compare performance to daxpy.

(of course, when you say "compile-time" optimization I interpret that as changes to compilation after lowering, not changes to definitions in Base.LinAlg.)

A similar thing for Matrix-Matrix multiplication is much, much harder. Although we could have arbitrarily clever iterators, they might or might not be not be the correct thing to reimplement a somewhat efficient dgemm in native Julia, or to call out to it "automagically". But this is definitely something worth considering for the future, and for the development of syntax, because superfluous temporaries after multiplying matrices (along with adding them, etc, which hopefully will be fixed in this roadmap) is IMHO currently one of Julia's numerical performance bottlenecks when working with very large matrices/tensors/etc. (On that note, does an interface for dgemm! currently exist for the memory-constrained users that want to extract the last bit of efficiency out of Julia?)

StefanKarpinski commented 8 years ago

@stevengj: are you planning on tackling this or should we figure out who else can tackle it?

stevengj commented 8 years ago

I'm planning on tackling the loop fusion. The other parts require someone to improve the type computation in broadcast (similar to #16622), and I was hoping someone else would tackle that. 😊

StefanKarpinski commented 8 years ago

If we don't implement "syntactic fusing" for 0.5 (status, @stevengj?) and plan on implementing it in a future version, we should amply document that this will change in the future and note that people should only use nested broadcasting in cases where such a transformation would not change the meaning.

Otherwise the only change above that seems to be slated to make it into 0.5 is the improvement of the output type computation.

JeffBezanson commented 8 years ago

We will only do #4883 for 0.5. That issue is part of the milestone, so moving this to 0.6

stevengj commented 8 years ago

Sorry, I've been traveling for multiple weeks; just got back yesterday.

ViralBShah commented 8 years ago

We missed you at JuliaCon.

s-broda commented 8 years ago

I wonder whether rather than just deprecating log(a::Array) and friends (#17302), a better use of the fact that #17300 frees up this syntax wouldn't be to make it syntactic sugar for calling a vector math library, leveraging the unified call syntax from @rprechelt's vectorize.jl. Wouldn't it be a bit silly to have to write @vectorize log.(a::Array) to `re-vectorize' the call?

I suppose the main drawback would be that it would make the log.(a:Array) syntax less discoverable, but I submit that it's even less likely that a Matlab convert would discover the @vectorize macro on their own, leaving a ~7x speed boost on the table.

martinholters commented 8 years ago

@s-broda If there are better (faster) implementations for certain operations, one can always add a specialized method, like broadcast(::typeof(log), a::Array{Float64})=... to use that for log.(A).

s-broda commented 8 years ago

@martinholters Fair enough. Is there any advantage to special casing it there versus defining a new method for log(a::Array) that I'm missing?

I'm not sure that this is simply a faster implementation, as there may be a 1 or 2 ULP difference between the the current implementation and using a vector math library. That's why I thought that having a special (convenient) syntax that reflects the difference in semantics might be useful.

stevengj commented 8 years ago

@s-broda, the plan is for log(a::Array) to go away (or at least to be deprecated quasi-permanently for the benefit of users coming from other languages). Besides the advantage of a completely general, automatic vectorizatino syntax, it's extremely useful to expose the user's intention of an element-wise broadcast at the syntax level, with log.(a), because that enables loop fusion, in-place operations, and other transformations that are very difficult to do at later compilation stages if the compiler has to figure out things on its own. And there is no real disadvantage of special-casing broadcast vs special-casing log.

s-broda commented 8 years ago

@stevengj I think there is a misunderstanding here. I couldn't agree more about the usefulness of the sin.(a) syntax, for precisely the reasons that you mention - indeed it's the one feature that I've been crossing my fingers would make it into 0.5, and I couldn't be more thrilled that you tackled it so quickly. My point was rather the opposite: namely, that because this effectively frees up the sin(a::Array) syntax, the now redundant sin(a::Array) could now be made syntactic sugar for calling into Accelerate/Yeppp!/VML in cases where no loop fusion is required. This, too, could thus happen at the syntax level, rather than relying on the compiler or calling Accelerate/Yeppp!/VML explicitly.

I suppose I should have raised this on the mailing list rather than here, sorry.

stevengj commented 8 years ago

@s-broda, that's not syntactic sugar, that is just an ordinary method, and requires nothing new in the language; you'll still be able to define foo(a::Array) methods if you want, regardless of the foo.(a) support.

davidanthoff commented 8 years ago

@s-broda And that could be done easily in a package.

toivoh commented 8 years ago

I would like to argue that once we stop supporting sin(a::Array), it should give an error, period (with an error message that tells you to use sin.(a), if possible).

Just because we stop supporting it doesn't mean that people will stop trying to use it, and we want them to get the error message to know that is not the intended way. Also, I really don't think that loading a package should change the behavior of code that doesn't use it.

tkelman commented 8 years ago

Agreed, the standard "type piracy" guideline is that a package should not extend base methods except on types defined by that package.

stevengj commented 8 years ago

Funnily enough, it turns out that the "more speculative" 0.6 proposals were actually easier to implement. Or more fun, at least, since they don't affect backwards-compatibility much.

StefanKarpinski commented 8 years ago

Life is so breezy when you don't have to worry about breaking people's code. Those were the days!

ViralBShah commented 8 years ago

I guess deprecating the old behaviours is something that should still wait.

stevengj commented 8 years ago

@ViralBShah, yeah, it's way too late in the 0.5 cycle for a massive deprecation.

bramtayl commented 8 years ago

It's pretty easy to write hijackable versions of broadcast. This would allow using the loop fusion mechanism to fuse loops for any function that would benefit from loop fusion, like filter or mapslices. I think? all that would be required is using the two functions below instead of the regular versions when parsing the dot syntax.

function maybe_broadcast(args...; _f = broadcast, kwargs...)
  _f(args...; kwargs...)
end

function maybe_broadcast!(args...; _f = broadcast!, kwargs...)
  _f(args...; kwargs...)
end
stevengj commented 8 years ago

How exactly would filter(f, g.(args...)) be able to use loop fusion? It seems like the parser would have to know about the filter function.

bramtayl commented 8 years ago

Maybe I'm missing something. Wouldn't

a.(b.(c.(A)), _f = filter)

turn into

maybe_broadcast(x -> a(b(c(x))), A, _f = filter)

turn into

filter(x -> a(b(c(x))), A)

?

Edit: Is the issue that it would turn into

maybe_broadcast((x, f) -> a(b(c(x)), _f = filter), A, filter)
stevengj commented 8 years ago

@bramtayl, no, it would turn into maybe_broadcast(x -> a(b(c(x)); _f = filter), A). Keyword arguments get passed to the respective function, not to broadcast.

stevengj commented 8 years ago

It seems like we'd need a new syntax, like filter..(a.(b.(c(A))).

bramtayl commented 8 years ago

darn. well that syntax seems nice?

stevengj commented 8 years ago

Don't take my offhand syntax suggestions too seriously! It would take a while and a lot of thought to hash out what the implications and semantics would be, and to decide whether a new syntax is really worth it vs. just typing filter(x -> a(b(c(x))), A). See #8450 for how long it took to settle on f.(args...).

stevengj commented 8 years ago

I feel like maybe we should have a "vectorization" label to group issues and PRs related to this stuff?