Closed itsdfish closed 5 years ago
Stan is using split_Rhat these days for single chains. I think that is simply the ratio of within variance over between variance of first half of the draws vs 2nd half. One proposal is to do that even with multiple chains. Not sure how MCMCDiagnostics does this. but shouldn't be too hard to figure that out from the code.
It is interesting in your first example that indeed pmap achieves improvements somewhere around 70% for physical cores, e.g. it takes a little bit longer than 1/4 for a 4 core machine. I've often seen that with CmdStan and Mamba with multiple chains. It seems lower though in your last example, but that could be caused by the underlying libraries.
I'm almost convinced that benchmark does not collect allocations etc. across cores. If that is true, is it the first processor it is reporting? I'll have a look at the open and closed BenchmarkTools issues, this must have come up before.
Yeah. I was converging on the same conclusion. If that is the case, we have three options:
Option three would entail a wrapper function to collect the benchmark.
function runSampler(s::AHMCNUTS,data;Nchains,kwargs...)
performance = pmap(x->wrapper(s),1:Nchains)
#some sort of unpacking of performance
return output
end
function wrapper(s)
@timed sample(s.model(data...),s.config)
end
For CmdStan, we would have to use the multiple file solution we were using before. The downside of option three is that it would not include the overhead costs of parallelism, but that might be ok.
Option 1 is more straightforward. The primary downside is that it cannot identify cases where each chain converges on a different mean value (but are otherwise stationary).
I'm not sure about option 2. In the best case, it would preserve relative differences between samplers (e.g. differences would be a simple linear transformation). In the worst case, it would give us useless or misleading information. It's not clear to me which state we would be in.
Maybe we should go for option 1 and do combine chains from different samplers. It is possible they converge on a different mean, but similar variance? So an rhat value not less than 1.01 should provide a hint.
Reverting back to option 1 seems straightforward. I can do that over the next couple of days.
Just to clarify, what do you mean by combining chains from different samplers? Do you mean when evaluating rhat for a given dataset chain1 might be Turing, chain2 is CmdStan and chain3 is DynamicHMC?
I think I understand what you are suggesting: revert to single chain setup and compute rhat for each sampler to test for stationarity (e.g. the mean and variance values are the same within chain), and compute a separate combined rhat to test for convergence across samplers. I think that might indirectly address the limitation.
Yes, exactly. I did that as a test for the DynamicHMC problem case (m10.4) and rhat ends up at 1.2. A few other - non problematic - models I tried worked fine.
Great. Thanks for clarifying. I will revert to the single chain set up and add a cross-sampler rhat value that we can use to identify non-convergent samplers. Good idea. This seems to be the best approach given the trade-off space that we are in.
Hi Rob-
A few days ago you noted that rhat calculations probably need multiple chains. I think this might be true. One question this raises is why rhat did not produce an error for a single chain. Although I don't completely understand rhat, it seems like there might be a split half calculation, which would allow rhat to be computed on a single chain. If true, I'm not sure whether rhat is valid with a single chain.
In order to have more confidence in rhat, I changed the code to run multiple chains in parallel. After running the benchmark with this new setup, I noticed some unusual differences with regard to the memory estimates. It seems like garbage collection often low or zero for parallel chains. One explanation I considered was that running chains sequentially on a single processor might trigger garbage collection more often than chains running independently on separate processors. However, the pattern remained after comparing pmap and map with a single chain.
I also noticed that running chains via map lead to more allocations and and a larger amount of memory allocated compared to pmap. I expected the opposite to be the case. Here are some examples
Results:
pmap
map
Here is what happens when I run a single chain using pmap and map:
Results:
pmap
map
It is important to note that
@timed
and@timev
produce similar results.Let's see what happens with a simpler example:
Results:
pmap
map
This pattern is still somewhat unexpected to me. While pmap produced more allocations, the amount of memory allocated was less than that of map. It almost seems like the memory estimates exclude the amount of memory used on the secondary processor and instead only measures how much is allocated on the primary processor when the results are collected. I'm perplexed. Do you understand what is happening here?