parallel execution on local machine (threads vs processes)

AshtonSBradley commented 7 years ago

I want to run this code on local machine (dual core), and make sure I am getting proper parallel speedup before deploying on small cluster. Am seeing some interesting timing behaviour and want to know best strategy for parallelism here. The @everywhere is trying to make it parallel. Some timing comments are at the end:

addprocs(1)
@everywhere using DifferentialEquations

f = @ode_def_bare Drift begin
  dα₁ = -(γ + 2*im*χ*α₁*ᾱ₁)α₁ + im*J*α₂
  dᾱ₁ = -(γ - 2*im*χ*ᾱ₁*α₁)ᾱ₁ - im*J*ᾱ₂
  dα₂ =       ϵ - 2*im*χ*α₂^2*ᾱ₂ + im*J*(α₁ + α₃)
  dᾱ₂ = conj(ϵ) + 2*im*χ*ᾱ₂^2*α₂ - im*J*(ᾱ₁ + ᾱ₃)
  dα₃ = -(γ + 2*im*χ*α₃*ᾱ₃)α₃ + im*J*α₂
  dᾱ₃ = -(γ - 2*im*χ*ᾱ₃*α₃)ᾱ₃ - im*J*ᾱ₂    
    end γ=>1. χ=>.001 ϵ=>10. J=1. 

@everywhere g = @ode_def_bare Diffusion begin
  dα₁ = sqrt(-2*im*χ*α₁^2)
  dᾱ₁ = sqrt(-2*im*χ*ᾱ₁^2)
  dα₂ = sqrt(-2*im*χ*α₂^2)
  dᾱ₂ = sqrt(-2*im*χ*ᾱ₂^2)
  dα₃ = sqrt(-2*im*χ*α₃^2)
  dᾱ₃ = sqrt(-2*im*χ*ᾱ₃^2)  
    end χ=>.001

@everywhere @inline function my_noise_function!(rand_vec,integrator)
  DiffEqBase.white_noise_func_wrapper!(rand_vec,integrator)
    rand_vec = randn(length(rand_vec))
end
@everywhere my_noise = NoiseProcess{:White,true,typeof(my_noise_function!)}(my_noise_function!)

@everywhere function boser(Ntraj::Int64)
ti = 0.
tf = 20.
Nt = 100
tspan = (ti,tf)
t = collect(linspace(ti,tf,Nt))
N0 = 40
a0 = sqrt(N0/2)*(randn(6)+im*randn(6))
a0[2] = conj(a0[1]);a0[4]=conj(a0[3]);a0[6]=conj(a0[5]);
prob = SDEProblem(f,g,a0,tspan,noise=my_noise)
#alg = EM()
alg = EulerHeun()
dt = 1/100
monte_prob = MonteCarloProblem(prob) #,prob_func=prob_func)
return solve(monte_prob,alg,dt=dt,saveat=t,save_everystep=false,dense=false,num_monte=Ntraj)
end

boser(1)
@everywhere Ntraj = 1000

sol = boser(Ntraj)
#runs a single trajectory in about 0.014381062 secs without parallelism
#runs in about 10 seconds without parallelism
#runs in about 20 seconds with parallelism on local machine (using @everywhere)

(error it what I posted initially: @everywhere not used in final sol = ... call, timings unchanged)

ChrisRackauckas commented 7 years ago

The way that you have this code, you are running the full simulation twice, and it will pmap it to 4 processes, and overload two. Instead you should just do sol = boser(Ntraj). That solve command for Monte Carlo problems has parallelism inside of it, so you don't want to @everywhere the call to it. Instead, you want the master process to initiate it, and then let the works do it.

Also, the number of processes is the number of worker processes. So for a dual core machine, you want at least 2 cores. But, the idea that the number of processes should match the number of cores is a myth propagated by people who don't really understand modern computing. In reality, for most algorithms you probably want more. This is because there's lags in there where extra processes can compute, while other processes are managing memory buses. In reality, you can check that you usually want more. This blog post I wrote is about almost the same problem:

http://www.stochasticlifestyle.com/236-2/

So the actual code should be something like:

addprocs(2) # Probably more is better, need to benchmark
@everywhere using DifferentialEquations

@everywhere f = @ode_def_bare Drift begin
  dα₁ = -(γ + 2*im*χ*α₁*ᾱ₁)α₁ + im*J*α₂
  dᾱ₁ = -(γ - 2*im*χ*ᾱ₁*α₁)ᾱ₁ - im*J*ᾱ₂
  dα₂ =       ϵ - 2*im*χ*α₂^2*ᾱ₂ + im*J*(α₁ + α₃)
  dᾱ₂ = conj(ϵ) + 2*im*χ*ᾱ₂^2*α₂ - im*J*(ᾱ₁ + ᾱ₃)
  dα₃ = -(γ + 2*im*χ*α₃*ᾱ₃)α₃ + im*J*α₂
  dᾱ₃ = -(γ - 2*im*χ*ᾱ₃*α₃)ᾱ₃ - im*J*ᾱ₂    
    end γ=>1. χ=>.001 ϵ=>10. J=1. 

@everywhere g = @ode_def_bare Diffusion begin
  dα₁ = sqrt(-2*im*χ*α₁^2)
  dᾱ₁ = sqrt(-2*im*χ*ᾱ₁^2)
  dα₂ = sqrt(-2*im*χ*α₂^2)
  dᾱ₂ = sqrt(-2*im*χ*ᾱ₂^2)
  dα₃ = sqrt(-2*im*χ*α₃^2)
  dᾱ₃ = sqrt(-2*im*χ*ᾱ₃^2)  
    end χ=>.001

@everywhere @inline function my_noise_function!(rand_vec,integrator)
  for i in eachindex(rand_vec)
    rand_vec[i] = randn()
  end
end
@everywhere my_noise = NoiseProcess{:White,true,typeof(my_noise_function!)}(my_noise_function!)

@everywhere function boser(Ntraj::Int64)
ti = 0.
tf = 20.
Nt = 100
tspan = (ti,tf)
t = collect(linspace(ti,tf,Nt))
N0 = 40
a0 = sqrt(N0/2)*(randn(6)+im*randn(6))
a0[2] = conj(a0[1]);a0[4]=conj(a0[3]);a0[6]=conj(a0[5]);
prob = SDEProblem(f,g,a0,tspan,noise=my_noise)
#alg = EM()
alg = EulerHeun()
dt = 1/100
monte_prob = MonteCarloProblem(prob) #,prob_func=prob_func)
return solve(monte_prob,alg,dt=dt,saveat=t,save_everystep=false,dense=false,num_monte=Ntraj)
end

@time @everywhere boser(1) # precompile everywhere
@time boser(1)

Ntraj = 1000
@time sol = boser(Ntraj)

AshtonSBradley commented 7 years ago

sorry, I was collecting the code from a comparison that involved parallel and non-parallel and I added an extra @everywhere in the final call. So I wasn't using @everywhere in call to solve, still getting the ~20s timing. Interesting about the nprocs though, will try more processes and see what happens. Sounds like you have written the code to scale very cleanly, its very cool.

ChrisRackauckas commented 7 years ago

I'm not getting 100% utilization from parallelism here. I think that the processes are finishing too quickly for pmap. It may be better to @parallel this.

ChrisRackauckas commented 7 years ago

@everywhere @inline function my_noise_function!(rand_vec,integrator)
  DiffEqBase.white_noise_func_wrapper!(rand_vec,integrator)
    rand_vec = randn(length(rand_vec))
end

That function doesn't make any sense.

AshtonSBradley commented 7 years ago

how should it read? I took the example at #154 and rewrote it to make all noises real. Is there a simpler way? Do you mean

@everywhere @inline

is bad syntax?

ChrisRackauckas commented 7 years ago

rewrote it to make all noises real

@everywhere @inline function my_noise_function!(rand_vec,integrator)
randn!(rand_vec)
end

AshtonSBradley commented 7 years ago

ok, I do that, and now

On worker 2:
MethodError: no method matching randn(::MersenneTwister, ::Type{Complex{Float64}})
Closest candidates are:
  randn{T}(::AbstractRNG, ::Type{T}, ::Tuple{Vararg{Int64,N}}) at random.jl:1216
  randn{T}(::AbstractRNG, ::Type{T}, ::Integer, ::Integer...) at random.jl:1218
  randn(::AbstractRNG) at random.jl:1129

which seems like the original problem with @everywhere using DifferentialEquations, but I am still doing that

ChrisRackauckas commented 7 years ago

Oh my bad. I keep forgetting that randn! cannot handle complex numbers...

@everywhere @inline function my_noise_function!(rand_vec,integrator)
  for i in eachindex(rand_vec)
    rand_vec[i] = randn()
  end
end

AshtonSBradley commented 7 years ago

Ok, so what is the right way to @parallel this? Should I do @parallel sol=solve(...)

?

ChrisRackauckas commented 7 years ago

Ok, so what is the right way to @parallel this?

That needs to be implemented in the Monte Carlo's solve function. I just implemented parallel_type=:pmap, with other options :parfor and :threads and am testing it all right now. It'll take a bit for this to release though. I would only use these options if the problem you are solving is extremely quick though.

ChrisRackauckas commented 7 years ago

I didn't realize how much you were using saveat. The other problem is that this might just be IO bound.

AshtonSBradley commented 7 years ago

Ok interesting. Btw, in EulerHeun is there a default number of iterates used for the implicit part?

ChrisRackauckas commented 7 years ago

Btw, in EulerHeun is there a default number of iterates used for the implicit part?

The EulerHeun method is just an Euler-Maruyama predictor and then corrected with a Trapezoid rule. I haven't ever seen it written in a way such that there a variable number of predictor steps, but I guess I could add a parameter in the algorithm for that.

ChrisRackauckas commented 7 years ago

This is definitely getting IO bound. The parallelism advantage is much higher when nothing is being saved. So this is something that would go to multiple nodes better. You may want to think about your approach to see whether how much you're saving is good, or we should look into a pre-allocation method for Monte Carlo.

AshtonSBradley commented 7 years ago

I could do a reduction to only save ensemble averages at each of tsave (I really do need of order 100 saves). Can I do this using output_func(sol) ? (and is such a reduction compatible with parallelism?)

ChrisRackauckas commented 7 years ago

No, it won't be able to do that. We can't do reductions like that, but that would be a good thing to have. That won't help here though since each run will still be allocating, thus still IO bound.

AshtonSBradley commented 7 years ago

Just to follow up on EulerHeun, I mistakenly assumed it was something like the method commonly used in cold atoms/quantum optics given below (semi-implicit Euler (Milstein) method, Gardiner, Handbook of Stochastic Methods). As it says at page end "about three iterates are usually sufficient". Would this be equivalent to EulerHeun if one added a few iterations of implicit solution? Any thoughts about this method vs EulerHeun?

screen shot 2017-04-17 at 2 41 39 pm

ChrisRackauckas commented 7 years ago

No, that's different. EulerHeun is an explicit method. (There's a typo in 15.4.16, it's ybar_n on the LHS right?). This method is a semi-implicit Milstein method, and is probably better when the drift equation is stiff (to some degree).

Yes we need some semi-implicit methods for SDEs. I have some ongoing research in higher order ones for Ito SDEs with diagonal noise, but haven't done much on the implementation side yet. I was making sure the ODE stuff had it all down to then use the same setup in SDEs

ChrisRackauckas commented 7 years ago

Yeah, I realized you can't pre-allocate to stop the IO bound on that type of Monte Carlo problem since no matter what you need every trajectory to store those 100 values. Probably need to benchmark it more, but I think multiple processes allocating like that is going to be hitting a wall not at the calculation but at the allocations. I think you can still get speedups, but I don't think you'll get close to optimal speedup in a case like this.

AshtonSBradley commented 7 years ago

ok good, that is clear (on both points - correct about the typo). Some degree of stiffness is fairly common in these kinds of problems. This Milstein method for Stratonovich would be great to have... it has good stability properties.

For this kind of problem the run time is not too long anyway (couple of hours for 10^6 trajectories), although more speed is always good. Other kinds of problems that are common (and could be IO bound) would involve propagating an array that takes all available memory (say a three-dimensional field SDE), and then needs ~100 saves. In that case though, I think the work done to integrate becomes the bottleneck, even though the IO load may be high. The write to file feature starts to become essential for this kind of problem, and it impacts the bulk of applications I have in mind (#151).

ChrisRackauckas commented 7 years ago

This Milstein method for Stratonovich would be great to have... it has good stability properties.

First the explicit one: https://github.com/JuliaDiffEq/StochasticDiffEq.jl/issues/19

For this kind of problem the run time is not too long anyway (couple of hours for 10^6 trajectories), although more speed is always good. Other kinds of problems that are common (and could be IO bound) would involve propagating an array that takes all available memory (say a three-dimensional field SDE), and then needs ~100 saves. In that case though, I think the work done to integrate becomes the bottleneck, even though the IO load may be high. The write to file feature starts to become essential for this kind of problem.

But then you're wanting to parallelize across multiple nodes though because of the memory-bound, right? That's actually okay. The IO problem is because even though you can have multiple processes running at the same time, they (mostly) cannot allocate memory at the same time, and so if you're saving a bunch of stuff in parallel you end up bound just by the speed of saving. But if you're going to multiple nodes instead of more processes on the same computer, then you will get the expected speed increase. And large problems are usually compute bound, so I don't think we'll have this issue there.

Another issue might be the RNG. I haven't checked whether the RNG needs to be parallelized here. But RandomNumbers.jl has a bunch of different RNGs that you can setup with the NoiseProcess interface and see what happens.

If you could run ProfileView.jl and other profiling tools (I don't necessarily know what would be best here, probably just ask on the Discourse) to better identify where the bottleneck is that would help.

AshtonSBradley commented 7 years ago

Here is the ProfileView result

profile_results.svg.zip

ChrisRackauckas commented 7 years ago

Hmm, not much we can do about it according to the profile. The vast majority of the time is compute time. Can probably find out how to ramp the processes up to 100% but that's about it. I think large Monte Carlo of simple problems might just need to use threading, and on HPCs might need:

https://github.com/JuliaLang/julia/issues/17887

to be efficient.

ChrisRackauckas commented 7 years ago

@everywhere using DifferentialEquations

@everywhere f = @ode_def_bare Drift begin
  dα₁ = -(γ + 2*im*χ*α₁*ᾱ₁)α₁ + im*J*α₂
  dᾱ₁ = -(γ - 2*im*χ*ᾱ₁*α₁)ᾱ₁ - im*J*ᾱ₂
  dα₂ =       ϵ - 2*im*χ*α₂^2*ᾱ₂ + im*J*(α₁ + α₃)
  dᾱ₂ = conj(ϵ) + 2*im*χ*ᾱ₂^2*α₂ - im*J*(ᾱ₁ + ᾱ₃)
  dα₃ = -(γ + 2*im*χ*α₃*ᾱ₃)α₃ + im*J*α₂
  dᾱ₃ = -(γ - 2*im*χ*ᾱ₃*α₃)ᾱ₃ - im*J*ᾱ₂    
    end γ=>1. χ=>.001 ϵ=>10. J=1. 

@everywhere g = @ode_def_bare Diffusion begin
  dα₁ = sqrt(-2*im*χ*α₁^2)
  dᾱ₁ = sqrt(-2*im*χ*ᾱ₁^2)
  dα₂ = sqrt(-2*im*χ*α₂^2)
  dᾱ₂ = sqrt(-2*im*χ*ᾱ₂^2)
  dα₃ = sqrt(-2*im*χ*α₃^2)
  dᾱ₃ = sqrt(-2*im*χ*ᾱ₃^2)  
    end χ=>.001

@everywhere @inline function my_noise_function!(rand_vec,integrator)
  for i in eachindex(rand_vec)
    rand_vec[i] = randn()
  end
end
@everywhere my_noise = NoiseProcess{:White,true,typeof(my_noise_function!)}(my_noise_function!)

@everywhere function boser(Ntraj;parallel_type=:none)
  ti = 0.
  tf = 20.
  Nt = 100
  tspan = (ti,tf)
  t = collect(linspace(ti,tf,Nt))
  N0 = 40
  a0 = sqrt(N0/2)*(randn(6)+im*randn(6))
  a0[2] = conj(a0[1]);a0[4]=conj(a0[3]);a0[6]=conj(a0[5]);
  prob = SDEProblem(f,g,a0,tspan,noise=my_noise)
  #alg = EM()
  alg = EulerHeun()
  dt = 1/100
  monte_prob = MonteCarloProblem(prob) #,prob_func=prob_func)
  solve(monte_prob,alg,dt=dt,saveat=t,save_everystep=false,dense=false,num_monte=Ntraj,parallel_type=parallel_type)
end

@time @everywhere boser(1) # precompile everywhere
@time boser(1)

Ntraj = 1000
@time sol = boser(Ntraj,parallel_type=:none)
# For threads, need to start julia with `export JULIA_NUM_THREADS=##` from the terminal
# Check thread count with Threads.nthreads()
@time sol = boser(Ntraj,parallel_type=:threads)
@time sol = boser(Ntraj,parallel_type=:split_threads)
@time sol = boser(Ntraj,parallel_type=:pmap)
@time sol = boser(Ntraj,parallel_type=:parfor)

ChrisRackauckas commented 7 years ago

With threading I get from 4.77 seconds down to 1.5-2.04 seconds on a 4 core computer. Similar on a beefy 16 core computer. So this shows that it was indeed the parallelism overhead and the IO bound were what's causing the slowdown: the trajectories were just too fast for multiprocessing. It even gets too fast for a large number of threads, and the IO bound kicks in when you try to do this with 16 at a time.

Because of this use case, I made :split_threads, which splits up a call into nprocs() evenly-sized groups and does the threadcall. This is good for putting one process on each node and scaling up the threaded version for these faster problems. It gives the threading speedup when just local.

:parfor should do better than :pmap if the problem is really fast. :pmap is the go-to for long problems.

I'll document this and call it a done deal.

AshtonSBradley commented 7 years ago

I am on

julia> Pkg.status("DifferentialEquations")
 - DifferentialEquations         1.10.1+            master

On a quad-core machine, if I do

export JULIA_NUM_THREADS=8

and launch Julia, I get

julia> Threads.nthreads()
8

Then I run your code (copy and paste, running in a notebook). I get these timings [nprocs()=1]

6.808233 seconds (16.09 M allocations: 650.933 MB, 3.18% gc time) 0.004729 seconds (19.73 k allocations: 1009.289 KB) 5.225747 seconds (19.64 M allocations: 965.785 MB, 11.81% gc time) 5.705755 seconds (19.62 M allocations: 965.041 MB, 14.67% gc time) 5.079793 seconds (19.62 M allocations: 965.041 MB, 13.20% gc time) 5.203354 seconds (19.62 M allocations: 965.041 MB, 14.90% gc time) 5.266911 seconds (19.62 M allocations: 965.041 MB, 15.85% gc time)

If I then clear and restart, and addprocs(3) as first line, I get the timings [nprocs()=4]

8.567897 seconds (9.56 M allocations: 390.850 MB, 2.27% gc time) 0.503988 seconds (16.96 k allocations: 902.648 KB) 5.138581 seconds (16.57 M allocations: 809.914 MB, 10.04% gc time) 5.155273 seconds (16.55 M allocations: 809.213 MB, 15.47% gc time) 5.291744 seconds (16.55 M allocations: 809.330 MB, 16.02% gc time) 5.313023 seconds (16.55 M allocations: 809.211 MB, 16.93% gc time) 5.272770 seconds (16.55 M allocations: 809.210 MB, 17.22% gc time)

Am I executing this correctly?

ChrisRackauckas commented 7 years ago

Oh, the threading hasn't been released yet. If you want to give it a try, Pkg.checkout("DiffEqMonteCarlo") for now. Remember to Pkg.free("DiffEqMonteCarlo") after you're done though.

AshtonSBradley commented 7 years ago

ok, that throws the error

MethodError: no method matching identity(::DiffEqBase.RODESolution{Array{Array{Complex{Float64},1},1},Void,Void,Array{Float64,1},StochasticDiffEq.LinearInterpolationData{Array{Array{Complex{Float64},1},1},Array{Float64,1}},Array{Array{Complex{Float64},1},1},DiffEqBase.SDEProblem{Array{Complex{Float64},1},Float64,true,true,:White,Drift,Diffusion,#my_noise_function!,DiffEqBase.CallbackSet{Tuple{},Tuple{}},Void},StochasticDiffEq.EulerHeun{StochasticDiffEq.RSWM{:RSwM1,Float64}}}, ::Int64)
Closest candidates are:
  identity(::Any) at operators.jl:113

regardless of threads or procs.

ChrisRackauckas commented 7 years ago

output_func was also changed. Now it needs two arguments, since it takes in i. For now to make it work without checking out more libraries, I think you can just do output_func = (sol,i) -> sol. The default isn't updated in your version of DiffEqBase.jl yet.

AshtonSBradley commented 7 years ago

ok, that worked with threading, and without addprocs(), I replaced

#monte_prob = MonteCarloProblem(prob) #this, with:
monte_prob = MonteCarloProblem(prob::SDEProblem;
                  output_func = (sol,i) -> sol,
                  prob_func= (prob,i)->prob)

However, when I addprocs(3) and run it all again, it survives parallel_type=:none and parallel_type=:threads (with a decent speedup, the same as without addprocs() ), but then for parallel_type=:split_threads it throws

ERROR (unhandled task failure): On worker 4:
UndefRefError: access to undefined reference
 in copy! at ./abstractarray.jl:559

followed by

On worker 4:
UndefRefError: access to undefined reference
 in copy! at ./abstractarray.jl:559
 in convert at ./array.jl:235
 in thread_monte at /Users/abradley/.julia/v0.5/DiffEqMonteCarlo/src/DiffEqMonteCarlo.jl:62

do I take it that split_threads doesn't make sense on the local machine?

ChrisRackauckas commented 7 years ago

do I take it that split_threads doesn't make sense on the local machine?

It doesn't make sense on a local machine. In fact, I would assume that it would end up slower on a local machine. Though I am not sure what that error is. You need to make sure you precompile once before attempting to run the parallel versions though (multithreading doesn't care). I think that might be the cause of the issue here? If you run it again does the same thing happen?

AshtonSBradley commented 7 years ago

ok good, exactly as you predicted: runs without error, and slows down from 2 second (:threads) to 26 seconds (:split_threads)

ChrisRackauckas commented 7 years ago

Yeah, that will have 4 processes spin up 4 threads each, and threading is really efficient. So they'll just clash with each other and slow down. But one process per node with :split_threads should do well.

AshtonSBradley commented 7 years ago

Mostly this seems to be working well. Strange behaviour: sometimes I get different numbers of trajectories returned (here i ask for 5000, and then try to calculate an average):

#trajectories are first index
#time is second index
#modes are final index

α₁(j) = sol.solution_data[j][:,1]
ᾱ₁(j) = sol.solution_data[j][:,2]
α₂(j) = sol.solution_data[j][:,3]
ᾱ₂(j) = sol.solution_data[j][:,4]

n1 = zeros(abs(α₁(1)))
n2 = zeros(n1)
for j in 1:Ntraj
    n1 += α₁(j).*ᾱ₁(j)
    n2 += α₂(j).*ᾱ₂(j)
end
n1 /= Ntraj;
n2 /= Ntraj;

throwing the error:

BoundsError: attempt to access 4999-element Array{DiffEqBase.RODESolution{Array{Array{Complex{Float64},1},1},Void,Void,Array{Float64,1},StochasticDiffEq.LinearInterpolationData{Array{Array{Complex{Float64},1},1},Array{Float64,1}},Array{Array{Complex{Float64},1},1},DiffEqBase.SDEProblem{Array{Complex{Float64},1},Float64,true,true,:White,Drift,Diffusion,#my_noise_function!,DiffEqBase.CallbackSet{Tuple{},Tuple{}},Void},StochasticDiffEq.SRIW1{StochasticDiffEq.RSWM{:RSwM3,Float64}}},1} at index [5000]

 in getindex(::Array{DiffEqBase.RODESolution{Array{Array{Complex{Float64},1},1},Void,Void,Array{Float64,1},StochasticDiffEq.LinearInterpolationData{Array{Array{Complex{Float64},1},1},Array{Float64,1}},Array{Array{Complex{Float64},1},1},DiffEqBase.SDEProblem{Array{Complex{Float64},1},Float64,true,true,:White,Drift,Diffusion,#my_noise_function!,DiffEqBase.CallbackSet{Tuple{},Tuple{}},Void},StochasticDiffEq.SRIW1{StochasticDiffEq.RSWM{:RSwM3,Float64}}},1}, ::Int64) at ./array.jl:386
 in α₁; at ./In[21]:6 [inlined]
 in macro expansion; at ./In[21]:14 [inlined]
 in anonymous at ./<missing>:?

Doesn't seem to matter which integrator I use ( EM(), SRIW1() both do this) There might be a nicer syntax for computing these ensemble averages, but it seems like I am not getting the right number of trajectories. Does this happen if one hits an instability due to particular noises?

Then some other times I get this other error:

UndefRefError: access to undefined reference

 in copy!(::Base.LinearFast, ::Array{DiffEqBase.RODESolution{Array{Array{Complex{Float64},1},1},Void,Void,Array{Float64,1},StochasticDiffEq.LinearInterpolationData{Array{Array{Complex{Float64},1},1},Array{Float64,1}},Array{Array{Complex{Float64},1},1},DiffEqBase.SDEProblem{Array{Complex{Float64},1},Float64,true,true,:White,Drift,Diffusion,#my_noise_function!,DiffEqBase.CallbackSet{Tuple{},Tuple{}},Void},StochasticDiffEq.SRIW1{StochasticDiffEq.RSWM{:RSwM3,Float64}}},1}, ::Base.LinearFast, ::Array{Any,1}) at ./abstractarray.jl:559

ChrisRackauckas commented 7 years ago

Strange behaviour: sometimes I get different numbers of trajectories returned

I am not sure about about that. That's a bug. Report that here?

https://github.com/JuliaDiffEq/DiffEqMonteCarlo.jl/issues

For what types of parallelism does that happen?

Does this happen if one hits an instability due to particular noises?

That would just result in a shorter sol trajectory, and you can check the sol.retcode for that.

There might be a nicer syntax for computing these ensemble averages

There's a mean extension in RecursiveArrayTools.jl. I think it may calculate the means for these easily. If not, it would be good to add some analysis functions like this in DiffEqMonteCarlo.

AshtonSBradley commented 7 years ago

its happening for :threads

ChrisRackauckas commented 7 years ago

This is mixing topics now. Please give some feature requests to DiffEqMonteCarlo for analysis tools you'd like to see. It would be good to have some.

SciML / DifferentialEquations.jl

parallel execution on local machine (threads vs processes) #156