Run Benchmarks Single-Threaded

storopoli commented 1 year ago

Stan: No Threading (no reduce_sum)
NONMEM: Running the chains without the parafile
Pumas: ensemblealg = EnsembleSerial()

storopoli commented 1 year ago

(Draft)

Using `niter=150` and `nburn=50`

Results (1 vCPU - 1 chain only):

NONMEM: 30.64s
Stan: 364.2s
Pumas: 977.72s

Results (4vCPU - 4 parallel chains):

NONMEM: 68.52s
Stan: 1830.4s
Pumas: 10205.56s

Results (8vCPU - 4 parallel chains):

NONMEM: 85.10s
Stan: 749.0s
Pumas: 1742.58s

Using `niter=250` and `nburn=250`.

Results (4vCPU - 4 parallel chains):

NONMEM: 68.52s
Stan: 1130.0s
Pumas: 1709.11s

Results (8vCPU - 4 parallel chains):

NONMEM: 89.64s
Stan: 539.7s
Pumas: 1130.72s

Cc @cbdavis33 and @PavanVaddady and @mohamed82008

andreasnoack commented 1 year ago

Why are all the 4 vCPU runs slower than the single vCPU run? What runs in parallel? Chains or the data?

storopoli commented 1 year ago

chains. and they ran faster (let me remove the x4 which was an annotation.

mohamed82008 commented 1 year ago

Andreas was referring to Pumas with 4 vCPUs with 4 chains taking 10205.56 s while 1vCPU with 1 chain taking 977.72 s which is less than 25% of the first number.

mohamed82008 commented 1 year ago

maybe that's what he meant or maybe no, but I find that odd too

mohamed82008 commented 1 year ago

Can we get the 1 vCPU result for niter=250 and nburn=250? I think the nburn = 50 case is not adapting the mass matrix and there is going to be too much variance in the performance of NUTS.

andreasnoack commented 1 year ago

For debugging purposes, it would be useful with a serial NONMEM run with diagonal mass matrix to see if that explains why NONMEM is faster than Stan.

storopoli commented 1 year ago

@andreasnoack, there's a problem. We have no idea what NONMEM configs in the .mod file are doing. They don't seem to obey. I don't know if we add D mass matrix it will be honored by NONMEM. For example, we added NBURN=500 and NITER=1000 but NONMEM is doing its own thing with the warmup NBURN and does not fix to 500. Instead it uses a dynamic routine that tunes ahead of time the number of NBURN.

andreasnoack commented 1 year ago

Okay. Are the numbers then comparable between NONMEM and Pumas/Stan? I.e. when you write niter=150 and nburn=50, do you think that NONMEM are actually running with those settings?

andreasnoack commented 1 year ago

Also, might be good to ask about the NONMEM issues on the NMUsers. Before doing so, are there any warnings in the lst?

andreasnoack commented 1 year ago

Instead it uses a dynamic routine that tunes ahead of time the number of NBURN.

The NONMEM docs has the following in the NUTS_BASE section

If NUTS_BASE<=-1.0, then NUTS_BASE will be set to the largest block section of the mass matrix plus 10. ... The AUTO feature set NUTS_BASE to -3

So when https://github.com/PumasAI-Labs/Bayesian-Benchmarks/blob/30e335e9c8fe16b70b38c7674e9ac8d03a4ca088/01-iv_2cmt_linear/NONMEM/iv-2cmt-linear/chains/iv-2cmt-linear-1.lst#L380-L386 doesn't start at -500 it could be because

https://github.com/PumasAI-Labs/Bayesian-Benchmarks/blob/30e335e9c8fe16b70b38c7674e9ac8d03a4ca088/01-iv_2cmt_linear/NONMEM/iv-2cmt-linear/chains/iv-2cmt-linear-1.lst#L346-L348

and

julia> 14 + 14*2 + 14*4 + 14*8 + 75 + 50
335

julia> 14 + 14*2 + 14*4 + 14*8 + 14*16 + 75 + 50
559

and NONMEM's NUTS then doesn't start a Stage II substage that cannot be completed within the NBURN budget. I.e. it's not a dynamic routine. I guess you could easily verify this by incrementing NBURN to 559. If my conjecture is right then the number of burn in iterations in the trace should increase between 558 and 559 if I've understood the docs and counted the block sizes correctly.

Anyway, the reason why I'm pointing this out is mostly that I don't think this smells like a NONMEM bug so if you set NUTS_MASS=D then I'd expect that the mass matrix is actually diagonal. I think two serial runs where the only difference is NUTS_MASS=B vs NUTS_MASS=D would be extremely valuable in order to understand how much of the performance difference is algorithmic and how much us related to the implementations.

PumasAI-Labs / Bayesian-Benchmarks