CCL benchmarking - Githubissues

damonge commented 5 years ago

I wasn't sure if this should be an issue on CCL or here, but here's a quick timing exercise I've carried out on the latest version of CCL (well, actually the version in this PR https://github.com/LSSTDESC/CCL/pull/614).

The script below generates the theoretical predictions for an LSST-like 3x2pt data vector, using time to time the different parts of the calculation. For this exercise I've taken 10 lens bins and 10 source bins, I've assumed that we want all possible auto- and cross-correlations, and that we compute each of them at 100 ell values between 10 and 3000 (the calculation doesn't depend much on what those values are anyway). In this case the results are (numbers are seconds):

Timing summary:
 - Cosmology and Pk: 1.669
 - GC tracers:       0.098
 - WL tracers:       0.366
 - Cl gg:            0.930
 - Cl gl:            1.345
 - Cl ll:            0.707
--------------------------
 Total:              5.115

so it takes around the same time to initialize CLASS's power spectrum as it does to compute the C_ells (actually, a bit longer for the latter). Note that this is for a standard LCDM cosmology with no funny business (neutrino masses etc.), which would make CLASS slower.

If we instead assume that we only need to evaluate the power spectra at 15 values of ell, then CLASS becomes the bottleneck (followed by the computation of the lensing kernels):

Timing summary:
 - Cosmology and Pk: 1.706
 - GC tracers:       0.096
 - WL tracers:       0.354
 - Cl gg:            0.139
 - Cl gl:            0.212
 - Cl ll:            0.101
--------------------------
 Total:              2.607

So the question is: is this fast enough?. E.g. an MCMC on a single core taking half a million samples would take around 1 month to run in the first case. With the CLASS bottleneck (i.e. if we reduced all other calculations to zero time), we could reduce that to ~10 days. Is it worth going through the trouble of achieving this?

If the answer to the above is "yes", a quick exploration tells me that the time taken to compute C_ells can't be reduced significantly simply by lowering the GSL integrator accuracy (I've only managed to reduce the C_ell time by a factor ~1.5 by raising the tolerance one order of magnitude), and we should actually look into other integration schemes.

Anyway, sorry for the long issue. Here's the script I've used for this:

import numpy as np
import pyccl as ccl
from scipy.special import erf
import time

# Prepare redshift distribution and bias
nbins = 10
z_edges = np.linspace(0, 2.5, nbins + 1)
z_sigmas = 0.05 * (1 + 0.5 * (z_edges[1:] + z_edges[:-1]))
numz = 512
zz = np.linspace(0, 3, numz)
nz_all = zz**2*np.exp(-(zz/0.8)**1.5)
nzs=0.5*(erf((z_edges[1:, None] - zz[None,:]) / np.sqrt(2 * z_sigmas[:,None]**2)) -
         erf((z_edges[:-1, None] - zz[None,:]) / np.sqrt(2 * z_sigmas[:,None]**2)))
nzs *= nz_all[None,:]
bz = 0.95 * (1 + zz)

start_all = time.time()
# Set up cosmology object and compute power spectrum
start = time.time()
cosmo = ccl.Cosmology(Omega_c=0.26066676,
                      Omega_b=0.048974682,
                      h=0.6766,
                      sigma8=0.8102,
                      n_s=0.9665)
s8 = ccl.sigma8(cosmo)
time_cosmo = time.time()-start

# Create tracers
start = time.time()
gc_t = [ccl.NumberCountsTracer(cosmo, False, (zz, nz), (zz, bz)) for nz in nzs]
time_gc_tracers = time.time()-start
start = time.time()
wl_t = [ccl.WeakLensingTracer(cosmo, (zz, nz)) for nz in nzs]
time_wl_tracers = time.time()-start

# Compute power spectra
n_ell = 15
l_arr = np.logspace(np.log10(10),np.log10(3000),n_ell)
start = time.time()
cls = np.zeros([2*nbins, 2*nbins, n_ell])
for i1, t1 in enumerate(gc_t):
    for i2,t2 in enumerate(gc_t[i1:]):
        cls[i1, i1+i2, :] = ccl.angular_cl(cosmo, t1, t2, l_arr)
time_clgg = time.time()-start
start = time.time()
for i1,t1 in enumerate(gc_t):
    for i2,t2 in enumerate(wl_t):
        cls[i1, nbins + i2] = ccl.angular_cl(cosmo, t1, t2, l_arr)
time_clgl = time.time()-start
start = time.time()
for i1, t1 in enumerate(wl_t):
    for i2, t2 in enumerate(wl_t[i1:]):
        cls[nbins + i1, nbins + i1 + i2] = ccl.angular_cl(cosmo, t1, t2, l_arr)
time_clll = time.time()-start

time_all = time.time()-start_all

print("Timing summary:")
print(" - Cosmology and Pk: %.3lf" % time_cosmo)
print(" - GC tracers:       %.3lf" % time_gc_tracers)
print(" - WL tracers:       %.3lf" % time_wl_tracers)
print(" - Cl gg:            %.3lf" % time_clgg)
print(" - Cl gl:            %.3lf" % time_clgl)
print(" - Cl ll:            %.3lf" % time_clll)
print("--------------------------")
print(" Total:              %.3lf" % time_all)

beckermr commented 5 years ago

So it is very likely that we will want to do a real space analysis as well. In this case, the time is probably going to be dominated by the limber integrator. Could you profile that too?

Also, to answer your question directly, we don't know how fast is fast enough. Generally MCMC chains don't have a ton of parallelism (except emcee) so single threaded performance and adjusting the integration settings will be concerns as we move forward.

damonge commented 5 years ago

The Hankel transforms are very fast with FFTLog:

Timing summary:
 - Cosmology and Pk: 1.569
 - GC tracers:       0.105
 - WL tracers:       0.393
 - Cl gg:            1.016
 - Cl gl:            1.507
 - Cl ll:            0.724
 - Xi gg:            0.073
 - Xi gl:            0.122
 - Xi ll+:           0.074
 - Xi ll-:           0.067
--------------------------
 Total:              5.650

updated script

import numpy as np
import pyccl as ccl
from scipy.special import erf
import time

# Prepare redshift distribution and bias
nbins = 10
z_edges = np.linspace(0, 2.5, nbins + 1)
z_sigmas = 0.05 * (1 + 0.5 * (z_edges[1:] + z_edges[:-1]))
numz = 512
zz = np.linspace(0, 3, numz)
nz_all = zz**2*np.exp(-(zz/0.8)**1.5)
nzs=0.5*(erf((z_edges[1:, None] - zz[None,:]) / np.sqrt(2 * z_sigmas[:,None]**2)) -
         erf((z_edges[:-1, None] - zz[None,:]) / np.sqrt(2 * z_sigmas[:,None]**2)))
nzs *= nz_all[None,:]
bz = 0.95 * (1 + zz)

start_all = time.time()
# Set up cosmology object and compute power spectrum
start = time.time()
cosmo = ccl.Cosmology(Omega_c=0.26066676,
                      Omega_b=0.048974682,
                      h=0.6766,
                      sigma8=0.8102,
                      n_s=0.9665)
s8 = ccl.sigma8(cosmo)
time_cosmo = time.time()-start

# Create tracers
start = time.time()
gc_t = [ccl.NumberCountsTracer(cosmo, False, (zz, nz), (zz, bz)) for nz in nzs]
time_gc_tracers = time.time()-start
start = time.time()
wl_t = [ccl.WeakLensingTracer(cosmo, (zz, nz)) for nz in nzs]
time_wl_tracers = time.time()-start

# Compute power spectra
n_ell = 100
l_arr = np.logspace(np.log10(10),np.log10(30000),n_ell)
cls = np.zeros([2*nbins, 2*nbins, n_ell])
start = time.time()
for i1, t1 in enumerate(gc_t):
    for i2, t2 in enumerate(gc_t[i1:]):
        cls[i1, i1+i2, :] = ccl.angular_cl(cosmo, t1, t2, l_arr)
time_clgg = time.time()-start
start = time.time()
for i1, t1 in enumerate(gc_t):
    for i2, t2 in enumerate(wl_t):
        cls[i1, nbins + i2] = ccl.angular_cl(cosmo, t1, t2, l_arr)
time_clgl = time.time()-start
start = time.time()
for i1, t1 in enumerate(wl_t):
    for i2, t2 in enumerate(wl_t[i1:]):
        cls[nbins + i1, nbins + i1 + i2] = ccl.angular_cl(cosmo, t1, t2, l_arr)
time_clll = time.time()-start

# Compute correlation functions
n_theta = 15
th_arr = np.logspace(0., 2., n_theta)/60.
start = time.time()
# xi_gg
start = time.time()
for i1, t1 in enumerate(gc_t):
    for i2, t2 in enumerate(gc_t[i1:]):
        wth = ccl.correlation(cosmo, l_arr, cls[i1, i1 + i2], th_arr,
                              corr_type='gg', method='fftlog')
time_xigg = time.time()-start
# xi_+
start = time.time()
for i1, t1 in enumerate(wl_t):
    for i2, t2 in enumerate(wl_t[i1:]):
        wth = ccl.correlation(cosmo, l_arr, cls[nbins + i1, nbins + i1 + i2], th_arr,
                              corr_type='l+', method='fftlog')
time_xilp = time.time()-start
# xi_-
start = time.time()
for i1, t1 in enumerate(wl_t):
    for i2, t2 in enumerate(wl_t[i1:]):
        wth = ccl.correlation(cosmo, l_arr, cls[nbins + i1, nbins + i1 + i2], th_arr,
                              corr_type='l-', method='fftlog')
time_xilm = time.time()-start
# xi_T
start = time.time()
for i1, t1 in enumerate(gc_t):
    for i2, t2 in enumerate(wl_t):
        wth = ccl.correlation(cosmo, l_arr, cls[i1, nbins + i2], th_arr,
                              corr_type='gl', method='fftlog')
time_xigl = time.time()-start

time_all = time.time()-start_all

print("Timing summary:")
print(" - Cosmology and Pk: %.3lf" % time_cosmo)
print(" - GC tracers:       %.3lf" % time_gc_tracers)
print(" - WL tracers:       %.3lf" % time_wl_tracers)
print(" - Cl gg:            %.3lf" % time_clgg)
print(" - Cl gl:            %.3lf" % time_clgl)
print(" - Cl ll:            %.3lf" % time_clll)
print(" - Xi gg:            %.3lf" % time_xigg)
print(" - Xi gl:            %.3lf" % time_xigl)
print(" - Xi ll+:           %.3lf" % time_xilp)
print(" - Xi ll-:           %.3lf" % time_xilm)
print("--------------------------")
print(" Total:              %.3lf" % time_all)

damonge commented 5 years ago

single threaded performance and adjusting the integration settings will be concerns as we move forward.

Right, I just want to check that this is not necessarily a priority for CCL right now.

beckermr commented 5 years ago

This is not clear to me. The DES likelihood I was trying to run took 10 seconds per evaluation. I suspect I was using more values for the Cells and need to decrease that. However, CLASS + other inits is only 2 seconds of the 5.6 total here. So we could easily speed the code up by ~2x w/ openmp, potentially more if people are evaluating more values for the Cells than 100.

damonge commented 5 years ago

Yes, I think this is likely to be the case, and I agree with parallelization (although you may be better off just running 16 separate chains than parallelizing one chain on 16 cores)

beckermr commented 4 years ago

Closing as I don't think there is a real todo here. The code runs as fast as it runs.

damonge commented 4 years ago

Ok, but if we decide that that isn't fast enough, we still have a couple tricks up our sleeve to make it faster (beyond parallelization)

LSSTDESC / firecrown

CCL benchmarking #66