ParallelAccelerator.embed() breakage

lkuper commented 8 years ago

The code inserted into userimg.jl by embed() seems to have stopped working recently:

/mnt/home/lkuper/pse-hpc/julia/base/precompile.jl
LoadError("sysimg.jl",319,LoadError("/mnt/home/lkuper/pse-hpc/julia/base/userimg.jl",4,AssertionError("CGen: GenSyms (generated symbols) cannot have Any (unresolved) type")))

It still uses accelerate -- is there any reason why it needs to? Could we change it to something that uses @acc? I no longer trust accelerate...

lkuper commented 8 years ago

Hmm. I can change userimg.jl to something like

Base.reinit_stdio()
include("/home/lkuper/.julia/v0.4/ParallelAccelerator/src/ParallelAccelerator.jl")
using ParallelAccelerator
@acc tmp_f(A,B)=begin runStencil((a, b) -> a[0,0] = b[0,0], A, B, 1, :oob_skip); A.*B.+2 end
tmp_f([1,2],[3,4])

and this runs fine, but it doesn't actually appear to speed up compilation. What exactly needs to be in userimg.jl for it to work?

lkuper commented 8 years ago

I spent a while trying to sort this issue out today, using the following code:

include("/home/lkuper/.julia/v0.4/ParallelAccelerator/src/ParallelAccelerator.jl")

ParallelAccelerator.set_debug_level(3)

tmp_f(A,B) = begin runStencil((a::Array{Float64,1}, b::Array{Float64,1}) -> a[0,0] = b[0,0], A, B, 1, :oob_skip); A.*B.+2 end

ParallelAccelerator.accelerate(tmp_f,(Array{Float64,1},Array{Float64,1},))

The error message is:

ERROR: LoadError: AssertionError: CGen: variable GenSym(2) cannot have Any (unresolved) type
 in from_lambda at /home/lkuper/.julia/v0.4/ParallelAccelerator/src/cgen.jl:446
 in from_expr at /home/lkuper/.julia/v0.4/ParallelAccelerator/src/cgen.jl:1974
 in from_root_entry at /home/lkuper/.julia/v0.4/ParallelAccelerator/src/cgen.jl:2410
 in toCGen at /home/lkuper/.julia/v0.4/ParallelAccelerator/src/driver.jl:210
 in accelerate at /home/lkuper/.julia/v0.4/ParallelAccelerator/src/driver.jl:409
 in accelerate at /home/lkuper/.julia/v0.4/ParallelAccelerator/src/driver.jl:363
 in include at ./boot.jl:261
 in include_from_node1 at ./loading.jl:304
while loading /home/lkuper/.julia/v0.4/ParallelAccelerator/example.jl, in expression starting on line 24

This gets as far as toFlatParfors, the "flattened code" stage. At that stage, GenSym(2) is:

GenSym(2) = $(Expr(:lambda, Any[:(a::Any),:(b::Any)], Any[Any[Any[:a,Any,0],Any[:b,Any,0]],Any[],1,Any[]], :(begin 
        (Main.getindex)(b,0,0)
        (Main.setindex!)(a,GenSym(0),0,0)
        return GenSym(0)
    end)))

I guess the Any[:(a::Any),:(b::Any)] is indicative of the problem.

This is pretty clearly the generated code for the (a, b) -> a[0,0] = b[0,0] function that's the first argument to runStencil. Adding type annotations on a and b in the source means that some type assertions also appear in the generated code:

        GenSym(2) = $(Expr(:lambda, Any[:(a::(top(apply_type))(Array,Float64,1)),:(b::(top(apply_type))(Array,Float64,1))], Any[Any[Any[:a,Any,18],Any[:b,Any,18]],Any[],1,Any[]], :(begin 
        (top(typeassert))(a,(top(apply_type))(Main.Array,Main.Float64,1))
        (top(typeassert))(b,(top(apply_type))(Main.Array,Main.Float64,1))
        (Main.getindex)(b,0,0)
        (Main.setindex!)(a,GenSym(0),0,0)
        return GenSym(0)
    end)))

But, doing so results in the same Any error as the other version does.

The code works fine with @acc instead of accelerate. (In the @acc version, there's nothing that's the equivalent of GenSym(2). The @acc version won't go through ParallelAccelerator unless we actually call the function with arguments, and in that case, the generated code is quite different -- the only occurrence of lambda in the generated code is the top-level $(Expr(:lambda, ...)).)

By the way, using the function tmp_f(A,B) ... syntax instead of the tmp_f(A, B) = ... shorthand doesn't change anything substantial. Using the do-block syntax for runStencil instead of the -> lambda syntax also doesn't.

There are really two problems here: making embed work again (whether it uses accelerate or not, I don't care), and figuring out what's going wrong with accelerate. I've been focusing on the latter, but I'm about ready to just give up and see if we can make embed work just using @acc.

lkuper commented 8 years ago

Trying again using @acc: if the contents of userimg.jl are

Base.reinit_stdio()

include("/home/lkuper/.julia/v0.4/ParallelAccelerator/src/ParallelAccelerator.jl")

using ParallelAccelerator

@acc tmp_f(A,B) = begin runStencil((a, b) -> a[0,0] = b[0,0], A, B, 1, :oob_skip); A.*B.+2 end

tmp_f([1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0])

then embed() seems to run fine, but if I quit and run my recompiled Julia again, I get results like this for black-scholes, for example:

iterations = 10000000
SELFPRIMED 16.835934927
checksum: 2.0954821257116848e8
rate = 1.6082537124446222e8 opts/sec
SELFTIMED 0.062179244

which show that it's not working (SELFPRIMED should be under a second).

Also, if I try to include, say, black-scholes.jl in the same REPL session right after the embed call, then I get various warnings and errors to do with precompilation.

lkuper commented 8 years ago

Further note: the userimg.jl is certainly running, because tmp_f is defined in a fresh REPL session, and furthermore, it runs with no compilation pause. However, @acc isn't defined until we run using ParallelAccelerator.

lkuper commented 8 years ago

@ehsantn @ninegua Any ideas what to try here?

lkuper commented 8 years ago

@JeffBezanson Following up on our discussion yesterday: If I just put using ParallelAccelerator in userimg.jl, and the lines

using ParallelAccelerator

@acc tmp_f(A,B) = begin runStencil((a, b) -> a[0,0] = b[0,0], A, B, 1, :oob_skip); A.*B.+2 end

tmp_f([1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0])

in the ParallelAccelerator.jl file itself right after the ParallelAccelerator module definition, then Julia compile time seems to take about 4 seconds longer when the userimg.jl file is present than when it is not. Over a couple of runs:

Without: real 2m11.072s user 2m9.978s sys 0m2.210s

With: real 2m15.581s user 2m14.409s sys 0m2.346s

Without: real 2m11.336s user 2m10.254s sys 0m2.238s

With: real 2m16.904s user 2m15.831s sys 0m2.170s

However, this doesn't actually seem to help:

julia> using ParallelAccelerator

julia> @acc tmp_f(A,B) = begin runStencil((a, b) -> a[0,0] = b[0,0], A, B, 1, :oob_skip); A.*B.+2 end
tmp_f (generic function with 1 method)

julia> @time tmp_f([1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0])
 17.113510 seconds (25.17 M allocations: 1.146 GB, 1.22% gc time")"
4-element Array{Float64,1}:
  3.0
  6.0
 11.0
 18.0

julia> @time tmp_f([1.0, 2.0, 3.0, 4.0], [1.0, 2.0, 3.0, 4.0])
  0.001106 seconds (60 allocations: 2.422 KB")"
4-element Array{Float64,1}:
  3.0
  6.0
 11.0
 18.0

and I wouldn't expect it to, since 4 seconds is probably not enough time to compile ParallelAccelerator (we'd expect it to take more like 20 seconds).

If I change the contents of userimg.jl to

include("/home/lkuper/.julia/v0.4/ParallelAccelerator/src/ParallelAccelerator.jl")

then that does seem to make a bigger difference to compile time:

real 2m19.778s user 2m18.697s sys 0m2.218s

and so I was hopeful, but nope, calling tmp_f is still slow on the first run.

What is interesting is that if I build a Julia that doesn't have the package pre-included, then running using ParallelAccelerator at the REPL is slow, as we'd expect because of the tmp_f stuff now in the actual file. If I have a Julia that does pre-include the package, then using ParallelAccelerator is instantaneous but actually calling an accelerated function is slow the first time. If those are my two options, then I guess I want the former, because I'd rather have running using ParallelAccelerator take 20 seconds than have the first call to an accelerated function be slow and have users suspect that ParallelAccelerator is making their code slower. So my plan for now is to just leave things as they are and stop encouraging people to use the embed functionality.

The order in which things are being computed is relevant here. Is it the fact that @acc is a macro that causes this to not work?

lkuper commented 8 years ago

https://github.com/IntelLabs/ParallelAccelerator.jl/commit/3c1f8a862f90099f56241e50925c01bfaf1ce56d moves the code that defines and runs tmp_f into the ParallelAccelerator.jl package file itself. This means that the delay is actually at package load time (at the time that using ParallelAccelerator is run). Using embed (which now just inserts using ParallelAccelerator into userimg.jl) will make using ParallelAccelerator instantaneous but just puts off the long delay until the first time the function is called. That, to me, is actually inferior to just having the delay be at package load time. So I updated the docs to reflect that we don't recommend that most users use embed. I think we can close this issue now.

lkuper commented 8 years ago

The previous approach caused problems in distributed mode, so #69 tries going back to the original accelerate-based approach with some tweaks. Going to close this for now, since #69 seems to fix it, modulo the issue discussed in that PR -- which hopefully won't be a problem for most people.

IntelLabs / ParallelAccelerator.jl

ParallelAccelerator.embed() breakage #56