Keno commented 7 years ago

We are missing significant optimization opportunity for small constant-size loops. For such loops it is not profitable to generate run-time alias checks, but we still loose significant performance due to lack of vectorization. I'm proposing introducing a macro to annotate variables that are known not to alias, similar to restrict in C. My initial thought is something like this.

@assert_nolias a::Array # Implies that `a` does not share memory with any other variable in the current scope

For generic structs:

@assert_noalias b::Foo # morally equivalent to
for i in nfields(typeof(b))
@assert_noalias getfield(b, i)
end

I think these semantics are pretty strong and may be weakened over time, but it seems like an ok place to start. Eventually we may want to introduce more first class support for non-aliasing arrays (which array almost are, were it not for sharing by reshaping etc).

Motivating benchmark is https://github.com/jeff-regier/Celeste.jl/issues/483

Keno commented 7 years ago

Motivating example that does not require loading Celeste:

function foo(a,b)
    @inbounds for i = 1:3, j=1:3
         a[i] += b[i,j]
    end
end
a = [1,2,3]
b = [1 2 3; 4 5 6; 7 8 9]

Looking at the IR:

define void @julia_foo_71319(%jl_value_t*, %jl_value_t*) #0 !dbg !5 {
top:
  %2 = bitcast %jl_value_t* %1 to i64**
  %3 = load i64*, i64** %2, align 8
  %4 = getelementptr inbounds %jl_value_t, %jl_value_t* %1, i64 3, i32 0
  %5 = bitcast %jl_value_t** %4 to i64*
  %6 = load i64, i64* %5, align 8
  %7 = bitcast %jl_value_t* %0 to i64**
  %8 = load i64*, i64** %7, align 8
  %9 = load i64, i64* %8, align 8
  %10 = load i64, i64* %3, align 8
  %11 = add i64 %10, %9
  store i64 %11, i64* %8, align 8
  %12 = getelementptr i64, i64* %3, i64 %6
  %13 = load i64, i64* %12, align 8
  %14 = add i64 %13, %11
  store i64 %14, i64* %8, align 8
  %15 = shl i64 %6, 1
  %16 = getelementptr i64, i64* %3, i64 %15
  %17 = load i64, i64* %16, align 8
  %18 = add i64 %17, %14
  store i64 %18, i64* %8, align 8
  %19 = getelementptr i64, i64* %8, i64 1
  %20 = load i64, i64* %19, align 8
  %21 = getelementptr i64, i64* %3, i64 1
  %22 = load i64, i64* %21, align 8
  %23 = add i64 %22, %20
  store i64 %23, i64* %19, align 8
  %24 = add i64 %6, 1
  %25 = getelementptr i64, i64* %3, i64 %24
  %26 = load i64, i64* %25, align 8
  %27 = add i64 %26, %23
  store i64 %27, i64* %19, align 8
  %28 = or i64 %15, 1
  %29 = getelementptr i64, i64* %3, i64 %28
  %30 = load i64, i64* %29, align 8
  %31 = add i64 %30, %27
  store i64 %31, i64* %19, align 8
  %32 = getelementptr i64, i64* %8, i64 2
  %33 = load i64, i64* %32, align 8
  %34 = getelementptr i64, i64* %3, i64 2
  %35 = load i64, i64* %34, align 8
  %36 = add i64 %35, %33
  store i64 %36, i64* %32, align 8
  %37 = add i64 %6, 2
  %38 = getelementptr i64, i64* %3, i64 %37
  %39 = load i64, i64* %38, align 8
  %40 = add i64 %39, %36
  store i64 %40, i64* %32, align 8
  %41 = add i64 %15, 2
  %42 = getelementptr i64, i64* %3, i64 %41
  %43 = load i64, i64* %42, align 8
  %44 = add i64 %43, %40
  store i64 %44, i64* %32, align 8
  ret void
}

We see a number of redundant stores to %19, because LLVM doesn't know the arrays don't alias. Removing those stores and thus enabling vectorization is the goal.

Keno commented 7 years ago

I should clarify that I intend the assertion to apply to the entire scope of the local variable, not just after the assertion.

vtjnash commented 7 years ago

Keno commented 7 years ago

@vtjnash feels strongly that these semantics are too strong and that instead, the alias set should be more directly described and properly scoped wrt to normal control flow. Thus, let me suggest the following:

@assert_noalias a, b, c # a,b,c don't alias. If they have fields, no two fields alias, etc.

With the assertion being an actual assertion in that it throws an error. After this assertion the compiler can rely on regular alias analysis (assuming it can figure out the check ok if it should). I still feel like there should a command line switch to elide the check, but that feature could then be implemented using an llvm pass that looks at the generated alias check and transforms it into metadata.

Keno commented 7 years ago

The obvious problem with this is of course that the check is at least n log n time and probably an equal amount in both code and data space.

Keno commented 7 years ago

In particular, the problem I have with mandating this is the following: It's a common pattern to have a type like:

immutable Preallocated
vec::NTuple{20,Vector{Float64}}
end

Now, I feel like it would be perfectly reasonable to want to assert none of those vectors alias, and then do things with the vectors without having to manually specify exactly which alias checks are needed. However, if we actually do make sure that there is no pairwise aliasing, we very quickly end up in a situation where the check dominates the actual work to be done. Consider for example the Celeste workload above. It has precisely this situation. The fastest we can do the alias check would be go generate a size 20 sort network and do the comparison. I haven't actually written it out, but I suspect that would take ~200 vector instructions. However, each loop in that example is only ~20 vector instructions and there are five of them, so even in that case, we're looking at a 2x overhead for the check.

Ideally the semantics would allow only emitting the check if the assumption was actually used, but I think that would be difficult, both semantically and from an implementation standpoint.

Keno commented 7 years ago

I suppose at least for the array case, we could assert that their buffers are not shared (since we do keep track of that). That would only cause a linear number of checks (though still has the problem of unnecessary checks).

yuyichao commented 7 years ago

I suppose at least for the array case, we could assert that their buffers are not shared

Doesn't prevent the array to be used in multiple places?

Keno commented 7 years ago

Yeah, I suppose that's true.

tkoolen commented 7 years ago

Eigen does this with a noalias() method, called on the destination variable, as follows:

C.noalias() = A * B; // where A, B and C are Eigen Matrices

Note that with Eigen, the result of A * B is just an unevaluated Eigen expression type. The evaluation of the expression happens in the assignment operator, so effectively the noalias() applies to the evaluation of the expression as a whole. By calling noalias(), the user provides a guarantee that evaluating the expression will not alias; whether that's actually true is not verified in any way.

Maybe this idea also makes sense in Julia. Instead of saying that several variables don't alias as in https://github.com/JuliaLang/julia/issues/19658#issuecomment-268131197 (including all their fields), one could instead tell the compiler that a certain expression doesn't alias. Applied to the example from https://github.com/JuliaLang/julia/issues/19658#issuecomment-268102261, perhaps something like @noalias @inbounds for i = 1:3, j=1:3 [...] would be nice.

Here's an example where annotating sets of variables with @assert_noalias, at least in the sense of https://github.com/JuliaLang/julia/issues/19658#issuecomment-268131197, wouldn't make sense, but annotating the expression would:

data = rand(3, 9)
A = view(data, :, 1:3)
B = view(data, :, 4:6)
C = view(data, :, 7:9)
A[:] = B * C # doesn't alias, but the 'parent' field of the SubArrays A, B, and C is the same Array

I'm also not sure whether having the compiler check that aliasing really doesn't occur is really necessary, or feasible. It feels very similar to @inbounds, so why treat them differently?

Keno commented 7 years ago

The problem I see with that is that's not entirely clear what such an annotation would mean when applied to an expression. There's not really an obvious semantic to apply here, because the operators can be arbitrary functions. Eigen can get away with it, because the operations and operands are limited, but that's not the case in Julia.

Keno commented 7 years ago

@vtjnash and I have discussed this some more and we discussed that perhaps a lexically scoped const annotation would be easier to reason about and implement. The options are pretty much as follows:

Option 1

function foo(a,b)
    @inbounds @const b for i = 1:3, j=1:3
        a[i] += b[i,j]
    end
end

lowers to something like

function foo(a,b)
    @inbounds let b = Const(b)
        $(Expr(:aliasscope))
        for i = 1:3, j=1:3
            a[i] += b[i,j]
        end
        $(Expr(:popaliasscope))
    end
end

Inside an alias scope, all stores do not alias any loads from Const arrays (and it would be illegal to write to the underlying memory).

Option 2

Have functions implicitly be an alias scope, and you can manually create Const wrappers, which are valid with respect to their scoped:

function foo(a,b)
    let b = Const(b)
        @inbounds for i = 1:3, j=1:3
            a[i] += b[i,j]
        end
    end
end

Though I suppose this would mean the same thing.

function foo(a,b)
    @inbounds for i = 1:3, j=1:3
        a[i] += Const(b)[i,j]
    end
end

The second option seems somewhat nice, but I fear it will get confusing. The first option quite clearly delineates in what region the const applies, while the second option is rather implicit.

KristofferC commented 7 years ago

The first option seems to fit in with existing performance macro annotations very nicely.

Keno commented 7 years ago

It was implicit, but I should perhaps mention it explicitly. One major advantage of the Const approach is that it's enforceable in a linear amount of code if we want to do so in the future.

StefanKarpinski commented 7 years ago

Even though it's a departure from the way we've done a lot of performance annotations, I like the explicitness of the Const approach. Does it also mean that we could define methods for Const types explicitly and pass Const values through to functions that are called, or would the Const wrapper not be that first class?

tkoolen commented 7 years ago

I think the Const approach would cover a lot of use cases. Something like the numerical integration example in this blog post ('Non-aliased memory windows' section) would not be covered though. I suppose the Const approach does require something like https://github.com/JuliaLang/julia/pull/12205; otherwise I'd expect GC/allocation overhead to dominate.

Keno commented 7 years ago

Yes const would be relatively first class. Functions would be an implicit alias scope, but you could also declare alias scopes directly with a macro (which would cover the numerical integration example). I'm very much concerned about the nonlocality of the semantics though.

StefanKarpinski commented 7 years ago

I'm very much concerned about the nonlocality of the semantics though.

This is the real reason to favor the macro approach – to avoid exposing anything non-lexical.

Keno commented 7 years ago

Having played with this a whole bunch, I find the Const approach lacking. You do also want to express no-alias between different arrays you're writing to (e.g. if update them using +=).

vtjnash commented 5 years ago

I’m a bit concerned by this when I see Hal saying that LLVM doesn’t support doing this for C due to observed correctness and performance issues (i.e., it sounds like they currently get the wrong result (wrong answer, slower) in the presence of inlining or operator overloading occurring, and that fixing that gave uniformly even worse performance (because the correct annotation seems it must prohibit many more important optimizations). Ref http://llvm.1065342.n5.nabble.com/llvm-dev-alias-scope-and-local-restricted-C-pointers-tp121373p121385.html for the latest comments I’ve seen about this.)

He says there that they’re working on improving it, and interested in new ideas though. It sounds like he expects that it will never really work for C++ though, due to the ability to define constructors and operators—something that’s obviously very relevant to Julia!

Keno commented 5 years ago

I'm increasingly in favor of a model where you can freeze a mutable array into an immutable one. If the compiler can figure out that your mutable array doesn't escape, it can immediately allocate the immutable one, otherwise it'll make a copy. That would solve 99% of the cases where you want aliasing. Most of the time arrays are immutable after construction and the aliasing problems arise because the compiler doesn't know which those kinds of arrays are. Making that super performant would require more sophisticated escape analysis, but I do think we can do that along with type inference (since your code needs to be well-inferred to be fast anyway).

ChrisRackauckas commented 5 years ago

That would solve 99% of the cases where you want aliasing. Most of the time arrays are immutable after construction and the aliasing problems arise because the compiler doesn't know which those kinds of arrays are.

99% seems like a high over-estimate. It won't cover anything differential equation related. For example, with issue https://github.com/JuliaLang/julia/issues/28126 it wouldn't be applicable since the arrays are built iteratively: a is defined by b, c, d, and e, but then it's added as the next array for the linear combination to create f, and this triangle grows. In optimization you have something similar as well: you're updating mutable cache arrays using other caches you just updated. With (iterative) linear algebra you're working on a bunch of mutable cache arrays with linear combinations between them.

Maybe that thought holds for more computer science applications, but the general workflow in scientific computing seems to be:

Update cache array with newest f call (differential equation function, optimization gradient, etc.
Update internal arrays using said cache
Decide how to call f again using the internal arrays.

For step 2 to be correct we have to setup the cache arrays so that they don't alias. But the compiler doesn't know that, so what we would need is some way to tell the compiler that.

StefanKarpinski commented 5 years ago

@Keno: what you're talking about reminds me of John Rose's "larval objects" idea:

https://blogs.oracle.com/jrose/larval-objects-in-the-vm

Implementation-wise, if we could convert a mutable value to a corresponding immutable one in a usually-zero-copy fashion, that would be what we want. Another way to approach that might be to be able to mark the mutable version is "illegal to use" and reassign the memory to the immutable version. Unfortunately, I can't think of a way to do that without either forcing a flag check on each access or using memory protections and faults.

Keno commented 5 years ago

Hadn't seen that before, but yeah, it's the same concept. Implementation to be worked out, but I think something like it is promising. Note that I do think it makes sense to be able to go both ways (with the compiler eliding if possible, copies if not).

JuliaLang / julia

RFC: Macro for expression noalias hints #19658

Option 1

Option 2