lukashergt commented 4 years ago

Hi @williamjameshandley,

given a completed PC run, is there a good way of predicting the runtime of a run with the same cosmology but with a different number of processes and/or live points?

Question in mathematical form

In other words, given the parameters {nprocess_old, nlive_old, time_old, nprocess_new, nlive_new} how can I estimate time_new for a next run, e.g. with improved precision.

I know runtime is roughly linear in both nlive and nprocess as long as nprocess < nlive, but a crude

time_new = time_old * (nlive_new / nlive_old) * (nprocess_old / nprocess_new)

is clearly not accurate enough. Any way of doing better?

Example

I got these timings for an exploratory and a full run. How could I have better estimated the runtime of the full run given the exploratory run?

	nprocess	nlive	runtime
exploratory run	32	100	62h
full run	320	500	138h

EDIT:

Answer

This should sum up the rather confusing discussion below.

A better prediction for the runtime can be given by the following expression:

def new_time(t_old, nlive_old, ncpu_old, nlive_new, ncpu_new):
    return t_old * np.log(1 + ncpu_old / nlive_old) / np.log(1 + ncpu_new / nlive_new)

PolyChord runtime prediction from old run

Note that this does not actually match the data I listed in the table above. However, I suspect that the problem there might have been that the runs from that table had very variable computation times for a single likelihood call across parameterspace, which ultimately rendered any attempt at predicting the total runtime highly unreliable.

williamjameshandley commented 4 years ago

Equation 20 in the polychord paper gives a more accurate version of the expected parallelisation:

speedup = nlive * log( 1 + nprocs/nlive)  ~ nprocs  if nlive < nprocs

However, you should bear in mind that there is also quite a large error bar on the total run time associated with the poisson variability in the volume compression.

You can estimate this by noting that the fractional error in the KL divergence will be the same order of magnitude as the fractional error of time taken over repeat runs.

lukashergt commented 4 years ago

I should have mentioned that I played around with that equation, but couldn't get it to give a better answer to my example. Maybe I don't properly get how to understand the word "speedup", here... How exactly does this apply to my example?

Does speedup mean, if a first run takes time_old on 1 processor then another run on nprocess_new processors will take time_new = speedup(nprocess_new) * time_old?

Doing time_new = (speedup_old / speedup_new) * (nlive_new / nlive_old) * time_old is awfully off, too, just like my crude estimate above... but maybe we just can't do better?

However, you should bear in mind that there is also quite a large error bar on the total run time associated with the poisson variability in the volume compression.

Comparing one nprocess=32, nlive=100 run to another I get very similar timings (difference of a few hours maybe, but far from factor 2 or worse).

lukashergt commented 4 years ago

I think what I initially didn't get is the per life point bit in the speedup.

Would you agree with the following? ~~time_new = (speedup_new / speedup_old) * (nlive_old / nlive_new) * time_old = log(1 + nprocs_new / nlive_new) / log(1 + nprocs_old / nlive_old) * time_old <del>~~

~~That seems to be indeed a better estimator~~, if I actually got it correct, now...

EDIT: I did not get it correct...

williamjameshandley commented 4 years ago

No, having taken a closer look at the numbers something does seem to be awry here. You can think of 'speedup' as being the effective number of processes that you're using, so in the exploratory run this is ~28, and in the full run this is ~247, so you're still in the realm where you'd expect a 9x speedup from parallelisation and a 5x slowdown from the increased nlive, i.e. I would expect the raw numbers of walltime to actually be reversed here, so the full run seems to be going a factor of four slower than I would expect

1) Is there a typo in your table? 2) Does PolyChord report 'efficient' parallelisation on all cores at the end of its run in both of these, and if not, what does it report? 3) How many iterations (i.e. print reports) do you get in between run re-starts in the stdout?

Your equation should have (speedup_old / speedup_new) at the start, which will show the 4x slowdown.

lukashergt commented 4 years ago

time_new = (speedup_new / speedup_old) * (nlive_old / nlive_new) * time_old

         = log(1 + nprocs_new / nlive_new) / log(1 + nprocs_old / nlive_old) * time_old

Nope, that can't be right, either... If we'd keep nprocs fixed time would decrease with increasing nlive...

Well, this is embarrassing... Sorry for this mess, I guess this goes to show that I am not sure of how to apply the speedup to actual timings...

williamjameshandley commented 4 years ago

(for future readers -- the last two posts 'crossed', so should be in the reverse order chronologically)

lukashergt commented 4 years ago

I would expect the raw numbers of walltime to actually be reversed here, so the full run seems to be going a factor of four slower than I would expect

Ok, then my intuition at least was correct and my example makes me go awry.

1. Is there a typo in your table?

No.

2. Does PolyChord report 'efficient' parallelisation on all cores at the end of its run in both of these, and if not, what does it report?

32 processes run: All scattered over orders of E-04, E-05, E-06 with the longest being 0.32025724E-04.

320 processes run: All below 1.0E-07 except the following 5:

Slave310: efficient MPI parallisation; wait_time/slice_time=    0.21638202E-02
Slave 46: efficient MPI parallisation; wait_time/slice_time=    0.38126300E-03
Slave199: efficient MPI parallisation; wait_time/slice_time=    0.67842783E-03
Slave214: efficient MPI parallisation; wait_time/slice_time=    0.15854352E-02
Slave 75: efficient MPI parallisation; wait_time/slice_time=    0.15781217E-02

3. How many iterations (i.e. print reports) do you get in between run re-starts in the stdout?

Multiple early on (8 in the first one), fewer as it progresses, down to 3 in the second last (running 12 hours) and 1 in the very last (running for 6 hours, of which a lot goes into postprocessing).

lukashergt commented 4 years ago

How often does PC save a checkpoint for resuming? Only at every print report or more often than that? If so, could I make it save checkpoints more often by adjusting compression_factor?

Could the increased timing have to do with unequal computation times across parameter space in case of calculations involving cosmologies with curvature? A single likelihood evaluation takes a lot longer the greater Omega_K.

williamjameshandley commented 4 years ago

Multiple early on (8 in the first one), fewer as it progresses, down to 3 in the second last (running 12 hours) and 1 in the very last (running for 6 hours, of which a lot goes into postprocessing).

Is this the same for both the exploratory and full run?

The parallelisation efficiency seems fine to me (as expected, as these are very slow likelihood calls)

How often does PC save a checkpoint for resuming? Only at every print report or more often than that? If so, could I make it save checkpoints more often by adjusting compression_factor?

It saves a checkpoint every print report, which is defined by the compression factor. This defaults to 1/e, which means it updates roughly every nlive iterations but you could safely change this to be much more often (say exp(-0.1), which would be every ~nlive/10 iterations)) in your case since the parallelisation efficiency is so high (setting this closer to one decreases the parallelisation efficiency).

Of the ideas we have explored, I think this is most likely to be the cause, although I am very surprised that this gives an order 4 slowdown.

Could the increased timing have to do with unequal computation times across parameter space in case of calculations involving cosmologies with curvature? A single likelihood evaluation takes a lot longer the greater Omega_K.

I would be surprised if this were the cause, as I don't think this behaviour would change scaling up or down. You can possibly answer this question by examining the number of slow likelihood calls in total, and using that to estimate the average time for each likelihood evaluation.

lukashergt commented 4 years ago

Is this the same for both the exploratory and full run?

No, sorry I should have specified that. That was for the full run only. (Exploratory run didn't need restarting.)

williamjameshandley commented 4 years ago

This definitely increases the likelihood that the lost time is in re-starts.

Given that you got a 62h walltime, can I presume the exploratory run was not on HPC? If so, then there may be differences due to the different machines -- It would be good to know what cobaya times the likelihoods as on each of the machines (as I imagine the 32 core machine may have more up-to-date CPUs than CSD3 -- which is due an upgrade), which may explain another portion of the discrepancy (although not a factor of 4).

lukashergt commented 4 years ago

This definitely increases the likelihood that the lost time is in re-starts.

Yes, there is definitely a hefty time lost in re-starts. Will definitely adjust the compression_factor.

Given that you got a 62h walltime, can I presume the exploratory run was not on HPC?

Correct, I already started an "exploratory" run on CSD3 for comparison, but I can already see that timings are going to be abysmal. So it seems that it's more than lost time in re-starts.

williamjameshandley commented 4 years ago

I think the critical thing to check therefore is the timing of the various likelihood components on both machines. If you run:

cobaya-run --test <yaml file>

on a yaml file generated by cobaya-cosmo-generator then this will output its timing for each component of the likelihood in 'calls per second' (ideally you would also make sure that this was tested on the same parameter set on both machines).

lukashergt commented 4 years ago

Didn't know about the --test option. Neat. Is there a way of getting more evaluations? Or do I need to switch to mcmc for that? With this few evaluations the timings are quite variable. On another call on CSD3 I got 25s for the classy evaluation time...

Summary table

	Local	CSD3
classy	35.0722 s	37.6114 s
planck_2018_lowl.tt	0.000352414 s	0.000793983 s
planck_2018_lowl.ee.	0.000169597 s	0.000296389 s
planck_2018_highl_plik.ttteee	0.0258009 s	0.0442965 s
planck_2018_lensing.clik	0.00083975 s	0.00106907 s

Local output

[model] Measuring speeds... (this may take a few seconds)
[model] Setting measured speeds (per sec): {planck_2018_lowl.TT: 2840.0, planck_2018_lowl.EE: 5890.0, planck_2018_highl_plik.TTTEEE: 38.8, planck_2018_lensing.clik: 1190.0, classy: 0.0285}
[polychord] Parameter blocks and their oversampling factors:
[polychord] *  1 : ['logA', 'n_s', 'r', 'Omega_k', 'theta_s_1e2', 'omega_b', 'omega_cdm', 'tau_reio']
[polychord] * 17 : ['A_planck']
[polychord] * 17 : ['calib_100T', 'calib_217T', 'A_cib_217', 'xi_sz_cib', 'A_sz', 'ksz_norm', 'gal545_A_100', 'gal545_A_143', 'gal545_A_143_217', 'gal545_A_217', 'ps_A_100_100', 'ps_A_143_143', 'ps_A_143_217', 'ps_A_217_217', 'galf_TE_A_100', 'galf_TE_A_100_143', 'galf_TE_A_100_217', 'galf_TE_A_143', 'galf_TE_A_143_217', 'galf_TE_A_217']
[run] Test initialization successful! You can probably run now without `--test`.
[planck_2018_lowl.tt] Average evaluation time for planck_2018_lowl.TT: 0.000352414 s  (2 evaluations)
[planck_2018_lowl.ee] Average evaluation time for planck_2018_lowl.EE: 0.000169597 s  (1 evaluations)
[planck_2018_highl_plik.ttteee] Average evaluation time for planck_2018_highl_plik.TTTEEE: 0.0258009 s  (1 evaluations)
[planck_2018_lensing.clik] Average evaluation time for planck_2018_lensing.clik: 0.00083975 s  (1 evaluations)
[classy] Average evaluation time for classy: 35.0722 s  (2 evaluations)

CSD3 output

[model] Measuring speeds... (this may take a few seconds)
[model] Setting measured speeds (per sec): {planck_2018_lowl.TT: 1260.0, planck_2018_lowl.EE: 3370.0, planck_2018_highl_plik.TTTEEE: 22.6, planck_2018_lensing.clik: 935.0, classy: 0.0266}
[polychord] Parameter blocks and their oversampling factors:
[polychord] *  1 : ['logA', 'n_s', 'r', 'Omega_k', 'theta_s_1e2', 'omega_b', 'omega_cdm', 'tau_reio']
[polychord] * 14 : ['A_planck']
[polychord] * 14 : ['calib_100T', 'calib_217T', 'A_cib_217', 'xi_sz_cib', 'A_sz', 'ksz_norm', 'gal545_A_100', 'gal545_A_143', 'gal545_A_143_217', 'gal545_A_217', 'ps_A_100_100', 'ps_A_143_143', 'ps_A_143_217', 'ps_A_217_217', 'galf_TE_A_100', 'galf_TE_A_100_143', 'galf_TE_A_100_217', 'galf_TE_A_143', 'galf_TE_A_143_217', 'galf_TE_A_217']
[run] Test initialization successful! You can probably run now without `--test`.
[planck_2018_lowl.tt] Average evaluation time for planck_2018_lowl.TT: 0.000793983 s  (5 evaluations)
[planck_2018_lowl.ee] Average evaluation time for planck_2018_lowl.EE: 0.000296389 s  (1 evaluations)
[planck_2018_highl_plik.ttteee] Average evaluation time for planck_2018_highl_plik.TTTEEE: 0.0442965 s  (1 evaluations)
[planck_2018_lensing.clik] Average evaluation time for planck_2018_lensing.clik: 0.00106907 s  (1 evaluations)
[classy] Average evaluation time for classy: 37.6114 s  (5 evaluations)

lukashergt commented 2 years ago

I've added this answer to the first post, so people don't need to read through the then following confusing discussion.

Answer

A better prediction for the runtime can be given by the following expression:
def new_time(t_old, nlive_old, ncpu_old, nlive_new, ncpu_new):
    return t_old * np.log(1 + ncpu_old / nlive_old) / np.log(1 + ncpu_new / nlive_new)

PolyChord / PolyChordLite

Prediction of runtime given an old run and new number of nlive and nprocess? #55

Question in mathematical form

Example

EDIT:

Answer

Summary table

Local output

CSD3 output

Answer