Closed lukashergt closed 2 years ago
Equation 20 in the polychord paper gives a more accurate version of the expected parallelisation:
speedup = nlive * log( 1 + nprocs/nlive) ~ nprocs if nlive < nprocs
However, you should bear in mind that there is also quite a large error bar on the total run time associated with the poisson variability in the volume compression.
You can estimate this by noting that the fractional error in the KL divergence will be the same order of magnitude as the fractional error of time taken over repeat runs.
I should have mentioned that I played around with that equation, but couldn't get it to give a better answer to my example. Maybe I don't properly get how to understand the word "speedup", here... How exactly does this apply to my example?
Does speedup mean, if a first run takes time_old
on 1 processor then another run on nprocess_new
processors will take time_new = speedup(nprocess_new) * time_old
?
Doing time_new = (speedup_old / speedup_new) * (nlive_new / nlive_old) * time_old
is awfully off, too, just like my crude estimate above... but maybe we just can't do better?
However, you should bear in mind that there is also quite a large error bar on the total run time associated with the poisson variability in the volume compression.
Comparing one nprocess=32, nlive=100
run to another I get very similar timings (difference of a few hours maybe, but far from factor 2 or worse).
I think what I initially didn't get is the per life point
bit in the speedup.
Would you agree with the following?
time_new = (speedup_new / speedup_old) * (nlive_old / nlive_new) * time_old = log(1 + nprocs_new / nlive_new) / log(1 + nprocs_old / nlive_old) * time_old <del>
That seems to be indeed a better estimator, if I actually got it correct, now...
EDIT: I did not get it correct...
No, having taken a closer look at the numbers something does seem to be awry here. You can think of 'speedup' as being the effective number of processes that you're using, so in the exploratory run this is ~28, and in the full run this is ~247, so you're still in the realm where you'd expect a 9x speedup from parallelisation and a 5x slowdown from the increased nlive, i.e. I would expect the raw numbers of walltime to actually be reversed here, so the full run seems to be going a factor of four slower than I would expect
1) Is there a typo in your table? 2) Does PolyChord report 'efficient' parallelisation on all cores at the end of its run in both of these, and if not, what does it report? 3) How many iterations (i.e. print reports) do you get in between run re-starts in the stdout?
Your equation should have (speedup_old / speedup_new) at the start, which will show the 4x slowdown.
time_new = (speedup_new / speedup_old) * (nlive_old / nlive_new) * time_old = log(1 + nprocs_new / nlive_new) / log(1 + nprocs_old / nlive_old) * time_old
Nope, that can't be right, either... If we'd keep nprocs
fixed time would decrease with increasing nlive
...
Well, this is embarrassing... Sorry for this mess, I guess this goes to show that I am not sure of how to apply the speedup
to actual timings...
(for future readers -- the last two posts 'crossed', so should be in the reverse order chronologically)
I would expect the raw numbers of walltime to actually be reversed here, so the full run seems to be going a factor of four slower than I would expect
Ok, then my intuition at least was correct and my example makes me go awry.
1. Is there a typo in your table?
No.
2. Does PolyChord report 'efficient' parallelisation on all cores at the end of its run in both of these, and if not, what does it report?
32 processes run: All scattered over orders of E-04
, E-05
, E-06
with the longest being 0.32025724E-04
.
320 processes run: All below 1.0E-07
except the following 5:
Slave310: efficient MPI parallisation; wait_time/slice_time= 0.21638202E-02
Slave 46: efficient MPI parallisation; wait_time/slice_time= 0.38126300E-03
Slave199: efficient MPI parallisation; wait_time/slice_time= 0.67842783E-03
Slave214: efficient MPI parallisation; wait_time/slice_time= 0.15854352E-02
Slave 75: efficient MPI parallisation; wait_time/slice_time= 0.15781217E-02
3. How many iterations (i.e. print reports) do you get in between run re-starts in the stdout?
Multiple early on (8 in the first one), fewer as it progresses, down to 3 in the second last (running 12 hours) and 1 in the very last (running for 6 hours, of which a lot goes into postprocessing).
How often does PC save a checkpoint for resuming? Only at every print report or more often than that? If so, could I make it save checkpoints more often by adjusting compression_factor
?
Could the increased timing have to do with unequal computation times across parameter space in case of calculations involving cosmologies with curvature? A single likelihood evaluation takes a lot longer the greater Omega_K.
Multiple early on (8 in the first one), fewer as it progresses, down to 3 in the second last (running 12 hours) and 1 in the very last (running for 6 hours, of which a lot goes into postprocessing).
Is this the same for both the exploratory and full run?
The parallelisation efficiency seems fine to me (as expected, as these are very slow likelihood calls)
How often does PC save a checkpoint for resuming? Only at every print report or more often than that? If so, could I make it save checkpoints more often by adjusting compression_factor?
It saves a checkpoint every print report, which is defined by the compression factor. This defaults to 1/e, which means it updates roughly every nlive iterations but you could safely change this to be much more often (say exp(-0.1), which would be every ~nlive/10 iterations)) in your case since the parallelisation efficiency is so high (setting this closer to one decreases the parallelisation efficiency).
Of the ideas we have explored, I think this is most likely to be the cause, although I am very surprised that this gives an order 4 slowdown.
Could the increased timing have to do with unequal computation times across parameter space in case of calculations involving cosmologies with curvature? A single likelihood evaluation takes a lot longer the greater Omega_K.
I would be surprised if this were the cause, as I don't think this behaviour would change scaling up or down. You can possibly answer this question by examining the number of slow likelihood calls in total, and using that to estimate the average time for each likelihood evaluation.
Is this the same for both the exploratory and full run?
No, sorry I should have specified that. That was for the full run only. (Exploratory run didn't need restarting.)
This definitely increases the likelihood that the lost time is in re-starts.
Given that you got a 62h walltime, can I presume the exploratory run was not on HPC? If so, then there may be differences due to the different machines -- It would be good to know what cobaya times the likelihoods as on each of the machines (as I imagine the 32 core machine may have more up-to-date CPUs than CSD3 -- which is due an upgrade), which may explain another portion of the discrepancy (although not a factor of 4).
This definitely increases the likelihood that the lost time is in re-starts.
Yes, there is definitely a hefty time lost in re-starts. Will definitely adjust the compression_factor
.
Given that you got a 62h walltime, can I presume the exploratory run was not on HPC?
Correct, I already started an "exploratory" run on CSD3 for comparison, but I can already see that timings are going to be abysmal. So it seems that it's more than lost time in re-starts.
I think the critical thing to check therefore is the timing of the various likelihood components on both machines. If you run:
cobaya-run --test <yaml file>
on a yaml file generated by cobaya-cosmo-generator
then this will output its timing for each component of the likelihood in 'calls per second' (ideally you would also make sure that this was tested on the same parameter set on both machines).
Didn't know about the --test
option. Neat. Is there a way of getting more evaluations? Or do I need to switch to mcmc for that? With this few evaluations the timings are quite variable. On another call on CSD3 I got 25s for the classy evaluation time...
Local | CSD3 | |
---|---|---|
classy | 35.0722 s | 37.6114 s |
planck_2018_lowl.tt | 0.000352414 s | 0.000793983 s |
planck_2018_lowl.ee. | 0.000169597 s | 0.000296389 s |
planck_2018_highl_plik.ttteee | 0.0258009 s | 0.0442965 s |
planck_2018_lensing.clik | 0.00083975 s | 0.00106907 s |
[model] Measuring speeds... (this may take a few seconds)
[model] Setting measured speeds (per sec): {planck_2018_lowl.TT: 2840.0, planck_2018_lowl.EE: 5890.0, planck_2018_highl_plik.TTTEEE: 38.8, planck_2018_lensing.clik: 1190.0, classy: 0.0285}
[polychord] Parameter blocks and their oversampling factors:
[polychord] * 1 : ['logA', 'n_s', 'r', 'Omega_k', 'theta_s_1e2', 'omega_b', 'omega_cdm', 'tau_reio']
[polychord] * 17 : ['A_planck']
[polychord] * 17 : ['calib_100T', 'calib_217T', 'A_cib_217', 'xi_sz_cib', 'A_sz', 'ksz_norm', 'gal545_A_100', 'gal545_A_143', 'gal545_A_143_217', 'gal545_A_217', 'ps_A_100_100', 'ps_A_143_143', 'ps_A_143_217', 'ps_A_217_217', 'galf_TE_A_100', 'galf_TE_A_100_143', 'galf_TE_A_100_217', 'galf_TE_A_143', 'galf_TE_A_143_217', 'galf_TE_A_217']
[run] Test initialization successful! You can probably run now without `--test`.
[planck_2018_lowl.tt] Average evaluation time for planck_2018_lowl.TT: 0.000352414 s (2 evaluations)
[planck_2018_lowl.ee] Average evaluation time for planck_2018_lowl.EE: 0.000169597 s (1 evaluations)
[planck_2018_highl_plik.ttteee] Average evaluation time for planck_2018_highl_plik.TTTEEE: 0.0258009 s (1 evaluations)
[planck_2018_lensing.clik] Average evaluation time for planck_2018_lensing.clik: 0.00083975 s (1 evaluations)
[classy] Average evaluation time for classy: 35.0722 s (2 evaluations)
[model] Measuring speeds... (this may take a few seconds)
[model] Setting measured speeds (per sec): {planck_2018_lowl.TT: 1260.0, planck_2018_lowl.EE: 3370.0, planck_2018_highl_plik.TTTEEE: 22.6, planck_2018_lensing.clik: 935.0, classy: 0.0266}
[polychord] Parameter blocks and their oversampling factors:
[polychord] * 1 : ['logA', 'n_s', 'r', 'Omega_k', 'theta_s_1e2', 'omega_b', 'omega_cdm', 'tau_reio']
[polychord] * 14 : ['A_planck']
[polychord] * 14 : ['calib_100T', 'calib_217T', 'A_cib_217', 'xi_sz_cib', 'A_sz', 'ksz_norm', 'gal545_A_100', 'gal545_A_143', 'gal545_A_143_217', 'gal545_A_217', 'ps_A_100_100', 'ps_A_143_143', 'ps_A_143_217', 'ps_A_217_217', 'galf_TE_A_100', 'galf_TE_A_100_143', 'galf_TE_A_100_217', 'galf_TE_A_143', 'galf_TE_A_143_217', 'galf_TE_A_217']
[run] Test initialization successful! You can probably run now without `--test`.
[planck_2018_lowl.tt] Average evaluation time for planck_2018_lowl.TT: 0.000793983 s (5 evaluations)
[planck_2018_lowl.ee] Average evaluation time for planck_2018_lowl.EE: 0.000296389 s (1 evaluations)
[planck_2018_highl_plik.ttteee] Average evaluation time for planck_2018_highl_plik.TTTEEE: 0.0442965 s (1 evaluations)
[planck_2018_lensing.clik] Average evaluation time for planck_2018_lensing.clik: 0.00106907 s (1 evaluations)
[classy] Average evaluation time for classy: 37.6114 s (5 evaluations)
I've added this answer to the first post, so people don't need to read through the then following confusing discussion.
Answer
A better prediction for the runtime can be given by the following expression:
def new_time(t_old, nlive_old, ncpu_old, nlive_new, ncpu_new): return t_old * np.log(1 + ncpu_old / nlive_old) / np.log(1 + ncpu_new / nlive_new)
Hi @williamjameshandley,
given a completed PC run, is there a good way of predicting the runtime of a run with the same cosmology but with a different number of processes and/or live points?
Question in mathematical form
In other words, given the parameters
{nprocess_old, nlive_old, time_old, nprocess_new, nlive_new}
how can I estimatetime_new
for a next run, e.g. with improved precision.I know runtime is roughly linear in both nlive and nprocess as long as
nprocess < nlive
, but a crudeis clearly not accurate enough. Any way of doing better?
Example
I got these timings for an exploratory and a full run. How could I have better estimated the runtime of the full run given the exploratory run?
EDIT:
Answer
This should sum up the rather confusing discussion below.
A better prediction for the runtime can be given by the following expression:
Note that this does not actually match the data I listed in the table above. However, I suspect that the problem there might have been that the runs from that table had very variable computation times for a single likelihood call across parameterspace, which ultimately rendered any attempt at predicting the total runtime highly unreliable.