Open storopoli opened 1 year ago
(Draft)
niter=150
and nburn=50
Results (1 vCPU - 1 chain only):
Results (4vCPU - 4 parallel chains):
Results (8vCPU - 4 parallel chains):
niter=250
and nburn=250
.Results (4vCPU - 4 parallel chains):
Results (8vCPU - 4 parallel chains):
Cc @cbdavis33 and @PavanVaddady and @mohamed82008
Why are all the 4 vCPU runs slower than the single vCPU run? What runs in parallel? Chains or the data?
chains. and they ran faster (let me remove the x4 which was an annotation.
Andreas was referring to Pumas with 4 vCPUs with 4 chains taking 10205.56 s
while 1vCPU with 1 chain taking 977.72 s
which is less than 25% of the first number.
maybe that's what he meant or maybe no, but I find that odd too
Can we get the 1 vCPU result for niter=250 and nburn=250? I think the nburn = 50 case is not adapting the mass matrix and there is going to be too much variance in the performance of NUTS.
For debugging purposes, it would be useful with a serial NONMEM run with diagonal mass matrix to see if that explains why NONMEM is faster than Stan.
@andreasnoack, there's a problem. We have no idea what NONMEM configs in the .mod
file are doing. They don't seem to obey. I don't know if we add D mass matrix it will be honored by NONMEM. For example, we added NBURN=500
and NITER=1000
but NONMEM is doing its own thing with the warmup NBURN
and does not fix to 500
. Instead it uses a dynamic routine that tunes ahead of time the number of NBURN
.
Okay. Are the numbers then comparable between NONMEM and Pumas/Stan? I.e. when you write niter=150
and nburn=50
, do you think that NONMEM are actually running with those settings?
Also, might be good to ask about the NONMEM issues on the NMUsers. Before doing so, are there any warnings in the lst
?
Instead it uses a dynamic routine that tunes ahead of time the number of NBURN.
The NONMEM docs has the following in the NUTS_BASE
section
If NUTS_BASE<=-1.0, then NUTS_BASE will be set to the largest block section of the mass matrix plus 10. ... The AUTO feature set NUTS_BASE to -3
So when https://github.com/PumasAI-Labs/Bayesian-Benchmarks/blob/30e335e9c8fe16b70b38c7674e9ac8d03a4ca088/01-iv_2cmt_linear/NONMEM/iv-2cmt-linear/chains/iv-2cmt-linear-1.lst#L380-L386 doesn't start at -500 it could be because
and
julia> 14 + 14*2 + 14*4 + 14*8 + 75 + 50
335
julia> 14 + 14*2 + 14*4 + 14*8 + 14*16 + 75 + 50
559
and NONMEM's NUTS then doesn't start a Stage II substage that cannot be completed within the NBURN
budget. I.e. it's not a dynamic routine. I guess you could easily verify this by incrementing NBURN
to 559. If my conjecture is right then the number of burn in iterations in the trace should increase between 558 and 559 if I've understood the docs and counted the block sizes correctly.
Anyway, the reason why I'm pointing this out is mostly that I don't think this smells like a NONMEM bug so if you set NUTS_MASS=D
then I'd expect that the mass matrix is actually diagonal. I think two serial runs where the only difference is NUTS_MASS=B
vs NUTS_MASS=D
would be extremely valuable in order to understand how much of the performance difference is algorithmic and how much us related to the implementations.
reduce_sum
)parafile
ensemblealg = EnsembleSerial()