Evaluate scalability of DESCQA tests

LSSTDESC / SRV-planning

Repository to plan and coordinate some of the Science Release and Validation Working Group tasks

3 stars 0 forks source link

Evaluate scalability of DESCQA tests #1

Open fjaviersanchez opened 2 years ago

fjaviersanchez commented 2 years ago

Something that we should try to address if we decide to use DESCQA as our V&V platform is to check how scalable the current tests are and evaluate which tests need to be rewritten. A first example would be the two-point tests (which I think can be now parallelized using the latest TreeCorr version with MPI?).

Related to this, before putting a lot of time and effort into improving these tests, we need to check which tests are meaningful to run and on which samples. I can imagine that two-point tests may require blinding and sample selection before being ran (and some auxiliary data products).

nsevilla commented 2 years ago

I think in particular the two point correlation function is something we are not going to run as part of the test suite, at the very least due to blinding considerations.

nsevilla commented 2 years ago

See this comment too.

nsevilla commented 2 years ago

Single histogram plotting, approximate runtimes on a un-salloc'ed notebook at NERSC node:

using GCRCatalogs get_quantities (4 float64 columns) -> ~15 minutes
using pyarrow read_table over a loop on tracts ->~7 minutes
spinning up some dask workers, as explained here ->~1 minute (not counting indexing step, another minute)
using CosmoHub Hive over Hadoop DB with DC2 object catalog --> ~1 minute first time, successive plots <30 s

nsevilla commented 2 years ago

Some other resources, based on spark: https://arxiv.org/pdf/1905.09034.pdf: AXS: A framework for fast astronomical data processing based on Apache Spark https://arxiv.org/pdf/1807.03078.pdf: Analysing billion-objects catalogue interactively: Apache Spark for physicists

nsevilla commented 2 years ago

Proposal from Dan Taranu:

import lsst.daf.butler as dafButler import math import numpy as np import matplotlib.pyplot as plt

butler = dafButler.Butler('/repo/dc2', collections=['2.2i/truth_summary'])

refs = list(butler.registry.queryDatasets(datasetType='truth_summary'))

ras = [None]*len(refs)

bins = np.arange(361) counts = np.zeros(len(bins) - 1)

parameters = {'columns': ('ra',)} for idx, ref in enumerate(refs): ra = butler.get(ref, parameters=parameters)['ra'].values countsbin, * = np.histogram(ra, bins=bins) counts += counts_bin

plt.step(bins[:-1], counts)

plt.show()

nsevilla commented 2 years ago

See @patricialarsen's GitHub descqa fork for implementations on MPI parallelization at descqa run_master level (run_master_slurm). Parallelization discussions:

nsevilla commented 1 year ago

Current recurring tests that we are using to test DESCQA: srv_ngals, srv_readiness, srv_gaap, srv_external run in less than 2 minutes over the whole DP0.2, using Patricia's implementation through run_master_parallel.sh using 32 nodes on Perlmutter. See this TER