Memory differences for parallel and single chains

itsdfish commented 5 years ago

Hi Rob-

A few days ago you noted that rhat calculations probably need multiple chains. I think this might be true. One question this raises is why rhat did not produce an error for a single chain. Although I don't completely understand rhat, it seems like there might be a split half calculation, which would allow rhat to be computed on a single chain. If true, I'm not sure whether rhat is valid with a single chain.

In order to have more confidence in rhat, I changed the code to run multiple chains in parallel. After running the benchmark with this new setup, I noticed some unusual differences with regard to the memory estimates. It seems like garbage collection often low or zero for parallel chains. One explanation I considered was that running chains sequentially on a single processor might trigger garbage collection more often than chains running independently on separate processors. However, the pattern remained after comparing pmap and map with a single chain.

I also noticed that running chains via map lead to more allocations and and a larger amount of memory allocated compared to pmap. I expected the opposite to be the case. Here are some examples

using MCMCBenchmarks,Distributed
Nchains=4
setprocs(Nchains)

ProjDir = @__DIR__
cd(ProjDir)

path = pathof(MCMCBenchmarks)
@everywhere begin
  using MCMCBenchmarks
  #Model and configuration patterns for each sampler are located in a
  #seperate model file.
  include(joinpath($path, "../../Models/Gaussian/Gaussian_Models.jl"))
end

#run this on primary processor to create tmp folder
include(joinpath(path,
  "../../Models/Gaussian/Gaussian_Models.jl"))

s = DHMCNUTS(sampleDHMC,2000)
data = GaussianGen(;Nd=1000)  

@benchmark reduce(chainscat,pmap(x->s.model($data...,$s.Nsamples),1:4))

@benchmark reduce(chainscat,map(x->s.model($data...,$s.Nsamples),1:4))

Results:

pmap

BenchmarkTools.Trial: 
  memory estimate:  1.58 MiB
  allocs estimate:  2700
  --------------
  minimum time:     813.789 ms (0.00% GC)
  median time:      838.423 ms (0.00% GC)
  mean time:        837.920 ms (0.00% GC)
  maximum time:     857.896 ms (0.00% GC)
  --------------
  samples:          6
  evals/sample:     1

map

BenchmarkTools.Trial: 
  memory estimate:  207.43 MiB
  allocs estimate:  6320990
  --------------
  minimum time:     3.046 s (1.36% GC)
  median time:      3.087 s (1.28% GC)
  mean time:        3.087 s (1.28% GC)
  maximum time:     3.129 s (1.21% GC)
  --------------
  samples:          2
  evals/sample:     1

Here is what happens when I run a single chain using pmap and map:

Results:

pmap

BenchmarkTools.Trial: 
  memory estimate:  82.59 KiB
  allocs estimate:  542
  --------------
  minimum time:     703.298 ms (0.00% GC)
  median time:      780.517 ms (0.00% GC)
  mean time:        770.912 ms (0.00% GC)
  maximum time:     825.555 ms (0.00% GC)
  --------------
  samples:          7
  evals/sample:     1

map

BenchmarkTools.Trial: 
  memory estimate:  48.08 MiB
  allocs estimate:  1477529
  --------------
  minimum time:     715.030 ms (1.11% GC)
  median time:      781.783 ms (1.11% GC)
  mean time:        776.526 ms (1.25% GC)
  maximum time:     809.273 ms (1.07% GC)
  --------------
  samples:          7
  evals/sample:     1

It is important to note that @timed and @timev produce similar results.

Let's see what happens with a simpler example:

using Distributed,BenchmarkTools
addprocs(4)

@everywhere f(n) = rand(n,n)^10

n = 1000

@benchmark pmap(x->f($n),1:4)

@benchmark map(x->f($n),1:4)

Results:

pmap

BenchmarkTools.Trial: 
  memory estimate:  30.54 MiB
  allocs estimate:  536
  --------------
  minimum time:     213.934 ms (0.98% GC)
  median time:      262.867 ms (0.66% GC)
  mean time:        270.226 ms (2.42% GC)
  maximum time:     394.445 ms (16.92% GC)
  --------------
  samples:          19
  evals/sample:     1

map

BenchmarkTools.Trial: 
  memory estimate:  152.59 MiB
  allocs estimate:  201
  --------------
  minimum time:     529.664 ms (1.86% GC)
  median time:      605.178 ms (16.75% GC)
  mean time:        600.095 ms (12.36% GC)
  maximum time:     671.439 ms (16.04% GC)
  --------------
  samples:          9
  evals/sample:     1

This pattern is still somewhat unexpected to me. While pmap produced more allocations, the amount of memory allocated was less than that of map. It almost seems like the memory estimates exclude the amount of memory used on the secondary processor and instead only measures how much is allocated on the primary processor when the results are collected. I'm perplexed. Do you understand what is happening here?

goedman commented 5 years ago

Stan is using split_Rhat these days for single chains. I think that is simply the ratio of within variance over between variance of first half of the draws vs 2nd half. One proposal is to do that even with multiple chains. Not sure how MCMCDiagnostics does this. but shouldn't be too hard to figure that out from the code.

It is interesting in your first example that indeed pmap achieves improvements somewhere around 70% for physical cores, e.g. it takes a little bit longer than 1/4 for a 4 core machine. I've often seen that with CmdStan and Mamba with multiple chains. It seems lower though in your last example, but that could be caused by the underlying libraries.

I'm almost convinced that benchmark does not collect allocations etc. across cores. If that is true, is it the first processor it is reporting? I'll have a look at the open and closed BenchmarkTools issues, this must have come up before.

itsdfish commented 5 years ago

Yeah. I was converging on the same conclusion. If that is the case, we have three options:

revert to the single chain setup we were using before, so long as rhat values are reasonable (they do not look much different for single and multiple chains)
measure memory of the primary core for multiple chains, as the code appears to do now
try to measure each core and aggregate the measures.

itsdfish commented 5 years ago

Option three would entail a wrapper function to collect the benchmark.

function runSampler(s::AHMCNUTS,data;Nchains,kwargs...)
    performance = pmap(x->wrapper(s),1:Nchains)
    #some sort of unpacking of performance
    return output
end

function wrapper(s)
    @timed sample(s.model(data...),s.config)
end

For CmdStan, we would have to use the multiple file solution we were using before. The downside of option three is that it would not include the overhead costs of parallelism, but that might be ok.

Option 1 is more straightforward. The primary downside is that it cannot identify cases where each chain converges on a different mean value (but are otherwise stationary).

I'm not sure about option 2. In the best case, it would preserve relative differences between samplers (e.g. differences would be a simple linear transformation). In the worst case, it would give us useless or misleading information. It's not clear to me which state we would be in.

goedman commented 5 years ago

Maybe we should go for option 1 and do combine chains from different samplers. It is possible they converge on a different mean, but similar variance? So an rhat value not less than 1.01 should provide a hint.

itsdfish commented 5 years ago

Reverting back to option 1 seems straightforward. I can do that over the next couple of days.

Just to clarify, what do you mean by combining chains from different samplers? Do you mean when evaluating rhat for a given dataset chain1 might be Turing, chain2 is CmdStan and chain3 is DynamicHMC?

itsdfish commented 5 years ago

I think I understand what you are suggesting: revert to single chain setup and compute rhat for each sampler to test for stationarity (e.g. the mean and variance values are the same within chain), and compute a separate combined rhat to test for convergence across samplers. I think that might indirectly address the limitation.

goedman commented 5 years ago

Yes, exactly. I did that as a test for the DynamicHMC problem case (m10.4) and rhat ends up at 1.2. A few other - non problematic - models I tried worked fine.

itsdfish commented 5 years ago

Great. Thanks for clarifying. I will revert to the single chain set up and add a cross-sampler rhat value that we can use to identify non-convergent samplers. Good idea. This seems to be the best approach given the trade-off space that we are in.

StatisticalRethinkingJulia / MCMCBenchmarks.jl

Memory differences for parallel and single chains #19