JuliaDynamics / DiscreteEvents.jl

Discrete event generation and simulation in Julia
MIT License
56 stars 10 forks source link

Improve parallel clocks #31

Closed pbayer closed 1 year ago

pbayer commented 4 years ago

Facilitate the use of parallel clocks:

This will improve greatly the speed of single- and multi-threaded simulations. It eliminates the need to use onthread to speedup task-based single-threaded simulations.

pbayer commented 3 years ago

Unfortunately I cannot see the following comment by @SamRodkey here on Github

its link should be https://github.com/pbayer/DiscreteEvents.jl/issues/31#issuecomment-814354724

Side note to above: I tried avoiding using parallel clocks at all and converting my existing sim to an activity-based model per the discussion in the benchmarks section, but that somehow made my code go even slower than with a process-based approach. I was surprised by this result. I also tried using DiscreteEvents.onthread(2) to de-conflict with the Julia scheduler, but that also didn't really give me much of a performance benefit.

It seems like using processes has very poor performance beyond 10 or so processes all putting/taking from Channels. Not sure if there are ways to target improving the scaling of this? The process-based modeling approach made the most intuitive sense to me and also made my code come out super readable for others.

pbayer commented 3 years ago

@SamRodkey posted another comment not visible here:

Not sure if this warrants a separate issue, but I'm a bit stumped as to how I keep getting task failures using parallel clocks.

I first ran into this problem when digging into why a simulation I was developing that has a bunch of parallel decoupled processes was so slow. I was running 20+ parallel processes that do not depend on each other and only communicate through channels and run time was 100x slower than with 2-4 parallel processes. I decided to turn on parallelism to see if that help improve run time.

It did, but many of my processes would randomly fail on calls to rand() even when I wasn't using random numbers.

I've been able to reproduce this sort of outcome on my machine with a far simpler example I'll paste in here. When I run the following on my machine:


using DiscreteEvents, Printf

here is the process

function TaskDoer(clock::Clock, input_queue::Channel{Float64})

task_duration = take!(input_queue)  # process will block on take

delay!(clock, task_duration)

end

"Helper function to display parallel clock processes/state" function display_pclock(clock)

@printf("\n")

for thread in 1:Threads.nthreads()
    if thread==1
        clk=clock
    else
        clk = pclock(clock, thread).clock
    end

    @printf("Clock %i\n", thread)

    for (n,p) in clk.processes
        @printf("\tProcess %i : %s : %s\n", n, p.f, p.task)
    end
end

end

function run_sim()

clock = PClock();

pseed!(100)

# create non-random durations for each task
durations = 0.3 .+ 0.1 * sin.(range(0., stop=2*pi, length=8000));

# create input queue
input_queue = Channel{Float64}(10000)

# spawn a bunch of processes
for i=1:32
    process!(clock, Prc(TaskDoer, input_queue), cid=(mod(i, Threads.nthreads()) + 1))
end

# put durations into the input channel
for d in durations
    put!(input_queue, d)
end

@time run!(clock, 1.);  # run for a second to get compile/JIT warmup
@time run!(clock, 5000.);  # run for full desired amount

display_pclock(clock);

return clock

end

clock = run_sim();

> I get the following output, or something very similar with a few of the failed tasks occurring in different places.
>

0.510163 seconds (450.81 k allocations: 26.262 MiB, 40.30% compilation time) 11.688433 seconds (35.38 M allocations: 771.571 MiB, 2.15% gc time)

Clock 1 Process 4 : TaskDoer : Task (runnable) @0x000000001ab35dc0 Process 2 : TaskDoer : Task (runnable) @0x000000001a773210 Process 3 : TaskDoer : Task (runnable) @0x000000001a8b7080 Process 1 : TaskDoer : Task (runnable) @0x000000001a10f3a0 Clock 2 Process 4 : TaskDoer : Task (failed) @0x00000000820652d0 Process 2 : TaskDoer : Task (runnable) @0x0000000082064e20 Process 3 : TaskDoer : Task (runnable) @0x0000000082065140 Process 1 : TaskDoer : Task (runnable) @0x000000008126e720 Clock 3 Process 4 : TaskDoer : Task (runnable) @0x000000001a033210 Process 2 : TaskDoer : Task (runnable) @0x000000001a032ef0 Process 3 : TaskDoer : Task (runnable) @0x000000001a033080 Process 1 : TaskDoer : Task (runnable) @0x00000000805eb6c0 Clock 4 Process 4 : TaskDoer : Task (runnable) @0x0000000080c847e0 Process 2 : TaskDoer : Task (runnable) @0x0000000080c844c0 Process 3 : TaskDoer : Task (failed) @0x0000000080c84650 Process 1 : TaskDoer : Task (runnable) @0x0000000080c87080 Clock 5 Process 4 : TaskDoer : Task (runnable) @0x000000001a0347e0 Process 2 : TaskDoer : Task (runnable) @0x000000001a0344c0 Process 3 : TaskDoer : Task (runnable) @0x000000001a034650 Process 1 : TaskDoer : Task (runnable) @0x000000001a034330 Clock 6 Process 4 : TaskDoer : Task (runnable) @0x000000001a0387e0 Process 2 : TaskDoer : Task (runnable) @0x000000001a0384c0 Process 3 : TaskDoer : Task (runnable) @0x000000001a038650 Process 1 : TaskDoer : Task (runnable) @0x000000001a038330 Clock 7 Process 4 : TaskDoer : Task (runnable) @0x000000001a03c7e0 Process 2 : TaskDoer : Task (runnable) @0x000000001a03c4c0 Process 3 : TaskDoer : Task (runnable) @0x000000001a03c650 Process 1 : TaskDoer : Task (runnable) @0x000000001a03c330 Clock 8 Process 4 : TaskDoer : Task (failed) @0x000000001a0407e0 Process 2 : TaskDoer : Task (runnable) @0x000000001a0404c0 Process 3 : TaskDoer : Task (runnable) @0x000000001a040650 Process 1 : TaskDoer : Task (runnable) @0x000000001a040330 Clock 1 (+7): state=:idle, t=5001.0, Δt=0.01, prc:4 scheduled ev:0, cev:0, sampl:0

> When I go into the REPL and look at these failed tasks, I get calls tacks that point to calls to rand inside of delay!. Is this expected? Why do we call rand at all? I looked into this a little bit and was very confused.
>
> Here is a representative call stack:

julia> pclock(clock, 8).clock.processes[4].task Task (failed) @0x000000001a0407e0 AssertionError: length(ints) == 501 Stacktrace: [1] mt_setfull!(r::Random.MersenneTwister, #unused#::Type{UInt64}) @ Random C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\RNGs.jl:260 [2] reserve1 @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\RNGs.jl:291 [inlined] [3] mt_pop! @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\RNGs.jl:296 [inlined] [4] rand @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\RNGs.jl:464 [inlined] [5] rand @ C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:256 [inlined] [6] rand(rng::Random.MersenneTwister, sp::Random.SamplerRangeNDL{UInt64, Int64}) @ Random C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\generation.jl:332 [7] rand(rng::Random.MersenneTwister, X::UnitRange{Int64}) @ Random C:\buildbot\worker\package_win64\build\usr\share\julia\stdlib\v1.6\Random\src\Random.jl:253 [8] _spawnid(c::Clock{Vector{DiscreteEvents.ClockChannel}}) @ DiscreteEvents ~.julia\packages\DiscreteEvents\DM41T\src\schedule.jl:212 [9] _spawnid(ac::DiscreteEvents.ActiveClock{DiscreteEvents.ClockEvent}) @ DiscreteEvents ~.julia\packages\DiscreteEvents\DM41T\src\schedule.jl:219 [10] _spawnid(c::Clock{Ref{DiscreteEvents.ActiveClock{DiscreteEvents.ClockEvent}}}) @ DiscreteEvents ~.julia\packages\DiscreteEvents\DM41T\src\schedule.jl:216 [11] _cid(c::Clock{Ref{DiscreteEvents.ActiveClock{DiscreteEvents.ClockEvent}}}, cid::Int64, spawn::Bool) @ DiscreteEvents ~.julia\packages\DiscreteEvents\DM41T\src\schedule.jl:224 [12] event!(clk::Clock{Ref{DiscreteEvents.ActiveClock{DiscreteEvents.ClockEvent}}}, ex::DiscreteEvents.var"#64#65"{Condition}, t::Float64; cid::Int64, spawn::Bool) @ DiscreteEvents ~.julia\packages\DiscreteEvents\DM41T\src\schedule.jl:72 [13] event! @ ~.julia\packages\DiscreteEvents\DM41T\src\schedule.jl:70 [inlined] [14] #event!#16 @ ~.julia\packages\DiscreteEvents\DM41T\src\schedule.jl:110 [inlined] [15] event! @ ~.julia\packages\DiscreteEvents\DM41T\src\schedule.jl:108 [inlined] [16] delay! @ ~.julia\packages\DiscreteEvents\DM41T\src\process.jl:116 [inlined] [17] TaskDoer(clock::Clock{Ref{DiscreteEvents.ActiveClock{DiscreteEvents.ClockEvent}}}, input_queue::Channel{Float64}) @ Main ~\basic_parallel_test.jl:11 [18] _loop(p::Prc, cycles::Float64) @ DiscreteEvents ~.julia\packages\DiscreteEvents\DM41T\src\process.jl:36 [19] (::DiscreteEvents.var"#58#60"{Prc, Float64})() @ DiscreteEvents .\task.jl:118


> I think I'm misunderstanding how the sim works under the hood.
>
> Perhaps this is user error and the documentation could describe how to better initialize the random number generator when running with PClock? I tried the use of pseed but that didn't seem to do the trick.
>
> Any advice @pbayer?

I post it here from my emails in order to work on it.
pbayer commented 3 years ago

@SamRodkey thank you for this report.

First Diagnosis

I can reproduce those failures. I am not sure about the further failure stack into random number generation, but the call is doing the wrong thing: after a call to delay! it tries to create an event! on a random clock (therefore the rand() call in _spawnid).

Instead a process running under a parallel clock must schedule an event to the same clock, not to another random one. In order to do so it must call event! with spawn = false. This is missing in the delay! function around line 116 in process.jl as reported in the above stack trace.

pbayer commented 3 years ago

@SamRodkey : In your first comment you reported:

It seems like using processes has very poor performance beyond 10 or so processes all putting/taking from Channels.

I know of "very poor performance" with lots and lots of processes (in the thousands or so). But I have to investigate performance degradation beyond "10 or so processes". That should not be the case.

The process-based modeling approach made the most intuitive sense to me and also made my code come out super readable for others.

I agree that often it is the more appropriate approach. And therefore there should be no performance degradation with numbers of processes in a reasonable range.

One first impression: In your TaskDoer function you do a take! from a channel. This actually slows a simulation by doing an IO and switching control to the Julia scheduler. But I do not know if that may be the cause for your observation of a performance degradation.

pbayer commented 3 years ago

ok, regarding the above Task failures the problem was the use of ifelse instead of the ternary operator _ ? _ : _ in _cid. Thus the _spawnid got evaluated even if not needed (with spawn=false). This will be fixed with v0.3.4. see below.

To do

There remains the problem with a call to rand() from a parallel thread causing a task failure even if it will not occur anymore in the above context. I must further investigate that.

pbayer commented 3 years ago

Register v0.3.4

@JuliaRegistrator register

Release notes:

This is a bugfix release:

JuliaRegistrator commented 3 years ago

Registration pull request created: JuliaRegistries/General/34111

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v0.3.4 -m "<description of version>" fba6278277cdfa8698544ff1f8f36c7176d02eb1
git push origin v0.3.4
SamRodkey commented 3 years ago

Sorry for the super delayed reply @pbayer.

I ended up deleting my comment originally because I think after working on it for a little while, I was not sure if it was user error and me doing something dumb that isn't how the code is intended to be used. Your comments above indicate that at least it wasn't expected behavior.

One first impression: In your TaskDoer function you do a take! from a channel. This actually slows a simulation by doing an IO and switching control to the Julia scheduler. But I do not know if that may be the cause for your observation of a performance degradation.

I am pretty aggressively using Channels to communicate certain data between parallel processes. I kind of figured that was the most "simple" approach when I have multiple agents all taking from the same queue, but maybe this isn't optimal for performance? I definitely do see my sim grinding alot and running slower than I think I expected when I have 16-32 agents all taking from a common task queue. I will investigate further this week.

pbayer commented 1 year ago

I close this issue since in my view parallel clocks have to be rewritten to make it stable, see issue #47 .

So, sorry, back to the drawing board 😥