brian-j-smith / Mamba.jl

Markov chain Monte Carlo (MCMC) for Bayesian analysis in julia
Other
254 stars 52 forks source link

parallel MCMC on 0.5 #109

Open tpapp opened 7 years ago

tpapp commented 7 years ago

Currently pmap seems to be disabled for v"0.5.0". I am wondering if there is a workaround, eg running the samples separately and then somehow merging. Usually run 3–5 chains and the speed gains would be significant.

bdeonovic commented 7 years ago

Issue is with Julia's pmap, we are waiting on some updates to it: https://github.com/JuliaLang/julia/issues/19000

amitmurthy commented 7 years ago

What exactly is the issue? While the keyword arguments to pmap have changed, the behavior has not.

bdeonovic commented 7 years ago

I think @brian-j-smith will be better able to answer this question, but if I recall the issue was that not all anonymous functions were being sent out to the workers.

amitmurthy commented 7 years ago

AFAIK the issue is with capturing globals in the anonymous function. Which existed prior to 0.5 too. A workaround for that issue is to wrap the closure creation in a let block.

julia> a=1
1

julia> remotecall_fetch(()->a, 2)
ERROR: On worker 2:
UndefVarError: a not defined
 in #1 at ./REPL[3]:1
.....

julia> let a=a; remotecall_fetch(()->a, 2); end
1
brian-j-smith commented 7 years ago

The pmap errors we are getting now in 0.5 do not occur in 0.4. Below is the 0.5 error, for what its worth. When I get some free time, I'll try to come up with a stand-alone test case that reproduces the error.

ERROR: LoadError: On worker 2:
UndefVarError: ##298#299 not defined
 in deserialize_datatype at .\serialize.jl:823
 in handle_deserialize at .\serialize.jl:571
 in deserialize at .\serialize.jl:881
 in deserialize_datatype at .\serialize.jl:835
 in handle_deserialize at .\serialize.jl:571
 in deserialize at .\serialize.jl:894
 in deserialize_datatype at .\serialize.jl:835
 in handle_deserialize at .\serialize.jl:571
 in deserialize at .\serialize.jl:881
 in deserialize_datatype at .\serialize.jl:835
 in handle_deserialize at .\serialize.jl:571
 in deserialize_array at .\serialize.jl:709
 in handle_deserialize at .\serialize.jl:569
 in deserialize at .\serialize.jl:541
 in ntuple at .\tuple.jl:65
 in handle_deserialize at .\serialize.jl:562
 in deserialize_msg at .\multi.jl:120
 in message_handler_loop at .\multi.jl:1317
 in process_tcp_streams at .\multi.jl:1276
 in #620 at .\event.jl:68
 in #remotecall_fetch#608(::Array{Any,1}, ::Function, ::Function, ::Base.Worker,
 ::Array{Any,1}, ::Vararg{Array{Any,1},N}) at .\multi.jl:1070
 in remotecall_fetch(::Function, ::Base.Worker, ::Array{Any,1}, ::Vararg{Array{A
ny,1},N}) at .\multi.jl:1062
 in #remotecall_fetch#611(::Array{Any,1}, ::Function, ::Function, ::Int64, ::Arr
ay{Any,1}, ::Vararg{Array{Any,1},N}) at .\multi.jl:1080
 in remotecall_fetch(::Function, ::Int64, ::Array{Any,1}, ::Vararg{Array{Any,1},
N}) at .\multi.jl:1080
 in #remotecall_pool#691(::Array{Any,1}, ::Function, ::Function, ::Function, ::W
orkerPool, ::Array{Any,1}, ::Vararg{Array{Any,1},N}) at .\workerpool.jl:93
 in remotecall_pool(::Function, ::Function, ::WorkerPool, ::Array{Any,1}, ::Vara
rg{Array{Any,1},N}) at .\workerpool.jl:91
 in #remotecall_fetch#694(::Array{Any,1}, ::Function, ::Function, ::WorkerPool,
::Array{Any,1}, ::Vararg{Array{Any,1},N}) at .\workerpool.jl:124
 in remotecall_fetch(::Function, ::WorkerPool, ::Array{Any,1}, ::Vararg{Array{An
y,1},N}) at .\workerpool.jl:124
 in (::Base.###699#700#702{WorkerPool,Mamba.#mcmc_worker!})(::Array{Any,1}, ::Fu
nction, ::Array{Any,1}, ::Vararg{Array{Any,1},N}) at .\workerpool.jl:151
 in (::Base.##699#701)(::Array{Any,1}, ::Vararg{Array{Any,1},N}) at .\workerpool
.jl:151
 in macro expansion at .\asyncmap.jl:63 [inlined]
 in (::Base.##757#759{Base.AsyncCollector,Base.AsyncCollectorState})() at .\task
.jl:360
amitmurthy commented 7 years ago

This is the reduced case:

julia> c = x->x
(anonymous function)

julia> d = x->c(x)
(anonymous function)

On 0.4:

julia> pmap(d, 1:2)
2-element Array{Any,1}:
 1
 2

On 0.5:

julia> pmap(d, 1:2)
ERROR: On worker 2:
UndefVarError: c not defined
 in #3 at ./REPL[2]:1
amitmurthy commented 7 years ago

Ref: https://github.com/JuliaLang/julia/issues/19456

brian-j-smith commented 7 years ago

Great! Thanks for coming up with a reduced case. I'll keep an eye on your other thread and possible developments in julia to address this. A work-around on our package side may be challenging since some anonymous functions are user-supplied.

bdeonovic commented 7 years ago

@amitmurthy still having pmap issues in v0.6

julia> include("../deonovib\\.julia/v0.6\\Mamba\\doc/examples/rats.jl")
MCMC Simulation of 10000 Iterations x 2 Chains...

ERROR: LoadError: On worker 2:
UndefVarError: ##332#333 not defined
deserialize_datatype at .\serialize.jl:969
handle_deserialize at .\serialize.jl:674
deserialize at .\serialize.jl:634
handle_deserialize at .\serialize.jl:681
deserialize at .\serialize.jl:1075
handle_deserialize at .\serialize.jl:687
deserialize at .\serialize.jl:1088
handle_deserialize at .\serialize.jl:687
deserialize at .\serialize.jl:1075
handle_deserialize at .\serialize.jl:687
deserialize_array at .\serialize.jl:871
handle_deserialize at .\serialize.jl:672
deserialize at .\serialize.jl:634
ntuple at .\tuple.jl:108
handle_deserialize at .\serialize.jl:664
deserialize_msg at .\distributed\messages.jl:98
message_handler_loop at .\distributed\process_messages.jl:161
process_tcp_streams at .\distributed\process_messages.jl:118
#99 at .\event.jl:73
Stacktrace:
 [1] #570 at .\asyncmap.jl:178 [inlined]
 [2] foreach(::Base.##570#572, ::Array{Any,1}) at .\abstractarray.jl:1731
 [3] maptwice(::Function, ::Channel{Any}, ::Array{Any,1}, ::Array{Array{Any,1},1}, ::Vararg{Array{Array{Any,1},1},N} where N) at .\asyncmap.jl:178
 [4] wrap_n_exec_twice(::Channel{Any}, ::Array{Any,1}, ::Base.Distributed.##204#207{WorkerPool}, ::Function, ::Array{Array{Any,1},1}, ::Vararg{Array{Array{Any,1},1},N} where N) at .\asyncmap.jl:154
 [5] #async_usemap#555(::Function, ::Void, ::Function, ::Base.Distributed.##188#190, ::Array{Array{Any,1},1}, ::Vararg{Array{Array{Any,1},1},N} where N) at .\asyncmap.jl:103

 [6] (::Base.#kw##async_usemap)(::Array{Any,1}, ::Base.#async_usemap, ::Function, ::Array{Array{Any,1},1}, ::Vararg{Array{Array{Any,1},1},N} where N) at .\<missing>:0
 [7] (::Base.#kw##asyncmap)(::Array{Any,1}, ::Base.#asyncmap, ::Function, ::Array{Array{Any,1},1}) at .\<missing>:0
 [8] #pmap#203(::Bool, ::Int64, ::Void, ::Array{Any,1}, ::Void, ::Function, ::WorkerPool, ::Function, ::Array{Array{Any,1},1}) at .\distributed\pmap.jl:126
 [9] pmap(::WorkerPool, ::Function, ::Array{Array{Any,1},1}) at .\distributed\pmap.jl:101
 [10] #pmap#213(::Array{Any,1}, ::Function, ::Function, ::Array{Array{Any,1},1}) at .\distributed\pmap.jl:156
 [11] pmap2(::Function, ::Array{Array{Any,1},1}) at C:\Users\deonovib\.julia\v0.6\Mamba\src\utils.jl:97
 [12] mcmc_master!(::Mamba.Model, ::UnitRange{Int64}, ::Int64, ::Int64, ::UnitRange{Int64}, ::Bool) at C:\Users\deonovib\.julia\v0.6\Mamba\src\model\mcmc.jl:52
 [13] #mcmc#29(::Int64, ::Int64, ::Int64, ::Bool, ::Function, ::Mamba.Model, ::Dict{Symbol,Any}, ::Array{Dict{Symbol,Any},1}, ::Int64) at C:\Users\deonovib\.julia\v0.6\Mamba\src\model\mcmc.jl:32
 [14] (::Mamba.#kw##mcmc)(::Array{Any,1}, ::Mamba.#mcmc, ::Mamba.Model, ::Dict{Symbol,Any}, ::Array{Dict{Symbol,Any},1}, ::Int64) at .\<missing>:0
 [15] include_from_node1(::String) at .\loading.jl:569
 [16] include(::String) at .\sysimg.jl:14
while loading C:\Users\deonovib\.julia\v0.6\Mamba\doc\examples\rats.jl, in expression starting on line 121
amitmurthy commented 7 years ago

Does your code serialize the same anonymous function across multiple workers? i.e., it is initially defined on wrkr_1 which sends it to wrkr_2 which sends it to wrkr_3 and so on (see code example https://github.com/JuliaLang/julia/pull/22675#issuecomment-313000815). If so, this should be fixed by https://github.com/JuliaLang/julia/pull/22675 which is backport pending onto 0.6.

bdeonovic commented 7 years ago

I don't think that is the issue. The relevant code seems like a straight forward use of pmap

  lsts = [
    Any[m, states[k], window, burnin, thin, ChainProgress(frame, k, N)]
    for k in chains
  ]
  results = pmap(mcmc_worker!, lsts)

where m is a Type that contains anonymous functions. I think m is the real issue. it is:

  type Model
    nodes::Dict{Symbol, Any}
    samplers::Vector{Sampler}
    states::Vector{ModelState}
    iter::Int
    burnin::Int
    hasinputs::Bool
    hasinits::Bool
  end

and Sampler is

  type Sampler{T}
    params::Vector{Symbol}
    eval::Function
    tune::T
    targets::Vector{Symbol}
  end

where a Sampler might look something like:

function RWM{T<:Real}(params::ElementOrVector{Symbol},
                      scale::ElementOrVector{T}; args...)
  samplerfx = function(model::Model, block::Integer)
    block = SamplingBlock(model, block, true)
    v = SamplerVariate(block, scale; args...)
    sample!(v, x -> logpdf!(block, x))
    relist(block, v)
  end
  Sampler(params, samplerfx, RWMTune())
end

so I think we are having trouble passing m to workers because it has quite a heavy nesting of types, named functions, and anonymous functions.

amitmurthy commented 7 years ago

Is there a sample code that I can test with locally? I assume all dependent packages are publicly available?

bdeonovic commented 7 years ago

Currently we have it hacked such that only sequential processing is possible. Simply Pkg.add("Mamba") and then modify the file .julia/v0.6/Mamba/src/utils.jl by adding comment before Version hack.

...
  if (nprocs() > 1) #& (VERSION < v"0.5-")
...

Run any of the examples that utilize multiple chains you will get the error.

e.g.

 include(".julia/v0.6/Mamba/doc/examples/rats.jl")

or after doing the include you can also replicate error with

remotecall_fetch(fieldnames, 2, model)

which is a little more direct and shows that the issue is sending model out to the different workers.

amitmurthy commented 7 years ago

On OSX I had trouble installing Mamba (Julia 0.6) . Gadfly related errors - does not work even with a checkout of Gadfly master. Will try again after a few days.

bdeonovic commented 7 years ago

What was the error? Seems strange since Travis shows it installing and all tests pass on OSX https://travis-ci.org/brian-j-smith/Mamba.jl

amitmurthy commented 7 years ago

Fails with

WARNING: could not import Multimedia.@try_display into Gadfly
ERROR: LoadError: UndefVarError: @try_display not defined
Stacktrace:
 [1] include_from_node1(::String) at ./loading.jl:569
 [2] include(::String) at ./sysimg.jl:14
 [3] anonymous at ./<missing>:2
while loading /Users/amitm/.julia/v0.6/Gadfly/src/Gadfly.jl, in expression starting on line 1052
ERROR: LoadError: Failed to precompile Gadfly to /Users/amitm/.julia/lib/v0.6/Gadfly.ji.
Stacktrace:
 [1] compilecache(::String) at ./loading.jl:703
 [2] _require(::Symbol) at ./loading.jl:490
amitmurthy commented 7 years ago

I'll delete the v0.6 packages directory and try again.

amitmurthy commented 7 years ago

The immediate culprit appears to be model.nodes[:s2_alpha].eval with

julia> typeof(model.nodes[:s2_alpha].eval)
Mamba.##338#339

It is not being serialized as it is defined under Mamba. Where and how is it being initialized?

amitmurthy commented 7 years ago
julia> addprocs(2)
julia> @everywhere module Foo
           type Bar
               x
               Bar() = new(()->1) 
           end
       end

WARNING: deprecated syntax "type" at REPL[2]:2.
Use "mutable struct" instead.

julia> bar = Foo.Bar()
Foo.Bar(Foo.#1)

julia> remotecall_fetch(typeof, 2, bar.x)
Foo.##1#2

However, if the anonymous function is explicitly created under module Foo

julia> bar.x = eval(Foo, :(()->1))
(::#3) (generic function with 1 method)

julia> remotecall_fetch(typeof, 2, bar.x)
ERROR: On worker 2:
UndefVarError: ##3#4 not defined
deserialize_datatype at ./serialize.jl:986
handle_deserialize at ./serialize.jl:687
deserialize at ./serialize.jl:647
amitmurthy commented 7 years ago
julia> addprocs(2);

julia> @everywhere module Foo
           mutable struct Bar
               x
               Bar() = new(()->1) 
           end

           make_bar() = (b=Bar(); b.x=()->1; b)
       end;

julia> bar = Foo.Bar();

julia> typeof(bar.x)
Foo.##1#2

julia> remotecall_fetch(typeof, 2, bar.x); # OK

julia> bar = Foo.make_bar();

julia> typeof(bar.x)
Foo.##3#4

julia> remotecall_fetch(typeof, 2, bar.x); # OK

julia> bar.x = eval(Foo, :(()->1));

julia> typeof(bar.x)
Foo.##5#6

julia> remotecall_fetch(typeof, 2, bar.x); # Not OK
ERROR: On worker 2:
UndefVarError: ##5#6 not defined
deserialize_datatype at ./serialize.jl:986
handle_deserialize at ./serialize.jl:687

Only when the anonymous function is explicitly defined under Foo does there seem to be a problem.

bdeonovic commented 7 years ago

Thanks for the extensive analysis @amitmurthy what is the next step? Is this something that needs to be addressed in our package or is this something on the julialang side that could be addressed?

amitmurthy commented 7 years ago

Both I think.

You can check if you can avoid an eval under module Mamba. I'll open an issue on Julialang for addressing this.

amitmurthy commented 7 years ago

You can follow the discussion on the linked issue above.

Meanwhile, I could run rats.jl in parallel on 0.6 by eval'ing into Main.

diff --git a/src/utils.jl b/src/utils.jl
index 438dbd3..aee2a00 100644
--- a/src/utils.jl
+++ b/src/utils.jl
@@ -7,7 +7,7 @@ end
 function modelfxsrc(literalargs::Vector{Tuple{Symbol, DataType}}, f::Function)
   args = Expr(:tuple, map(arg -> Expr(:(::), arg[1], arg[2]), literalargs)...)
   expr, src = modelexprsrc(f, literalargs)
-  fx = eval(Expr(:function, args, expr))
+  fx = eval(Main, Expr(:function, args, expr))
   (fx, src)
 end

@@ -92,7 +92,7 @@ end
 ## called and will apply its error processing.

 function pmap2(f::Function, lsts::AbstractArray)
-  if (nprocs() > 1) & (VERSION < v"0.5-")
+  if (nprocs() > 1) #& (VERSION < v"0.5-")
     @everywhere importall Mamba
     pmap(f, lsts)
   else
diff --git a/src/variate.jl b/src/variate.jl
index 8875ad0..1c6b91c 100644
--- a/src/variate.jl
+++ b/src/variate.jl
@@ -112,7 +112,7 @@ const BinaryScalarMethods = [
 ]

 for op in BinaryScalarMethods
-  @eval ($op)(x::ScalarVariate, y::ScalarVariate) = ($op)(x.value, y.value)
+  @eval Main ($op)(x::Mamba.ScalarVariate, y::Mamba.ScalarVariate) = ($op)(x.value, y.value)
 end

 const RoundScalarMethods = [
@@ -123,8 +123,8 @@ const RoundScalarMethods = [
 ]

 for op in RoundScalarMethods
-  @eval ($op)(x::ScalarVariate) = ($op)(x.value)
-  @eval ($op){T}(::Type{T}, x::ScalarVariate) = ($op)(T, x.value)
+  @eval Main ($op)(x::Mamba.ScalarVariate) = ($op)(x.value)
+  @eval Main ($op){T}(::Type{T}, x::Mamba.ScalarVariate) = ($op)(T, x.value)
 end

 const UnaryScalarMethods = [
@@ -142,5 +142,5 @@ const UnaryScalarMethods = [
 ]

 for op in UnaryScalarMethods
-  @eval ($op)(x::ScalarVariate) = ($op)(x.value)
+  @eval Main ($op)(x::Mamba.ScalarVariate) = ($op)(x.value)
 end