Open fjaviersanchez opened 2 years ago
I think in particular the two point correlation function is something we are not going to run as part of the test suite, at the very least due to blinding considerations.
See this comment too.
Single histogram plotting, approximate runtimes on a un-salloc'ed notebook at NERSC node:
Some other resources, based on spark: https://arxiv.org/pdf/1905.09034.pdf: AXS: A framework for fast astronomical data processing based on Apache Spark https://arxiv.org/pdf/1807.03078.pdf: Analysing billion-objects catalogue interactively: Apache Spark for physicists
Proposal from Dan Taranu:
import lsst.daf.butler as dafButler import math import numpy as np import matplotlib.pyplot as plt
butler = dafButler.Butler('/repo/dc2', collections=['2.2i/truth_summary'])
refs = list(butler.registry.queryDatasets(datasetType='truth_summary'))
ras = [None]*len(refs)
bins = np.arange(361) counts = np.zeros(len(bins) - 1)
parameters = {'columns': ('ra',)} for idx, ref in enumerate(refs): ra = butler.get(ref, parameters=parameters)['ra'].values countsbin, * = np.histogram(ra, bins=bins) counts += counts_bin
plt.step(bins[:-1], counts)
plt.show()
See @patricialarsen's GitHub descqa fork for implementations on MPI parallelization at descqa run_master level (run_master_slurm). Parallelization discussions:
Something that we should try to address if we decide to use DESCQA as our V&V platform is to check how scalable the current tests are and evaluate which tests need to be rewritten. A first example would be the two-point tests (which I think can be now parallelized using the latest TreeCorr version with MPI?).
Related to this, before putting a lot of time and effort into improving these tests, we need to check which tests are meaningful to run and on which samples. I can imagine that two-point tests may require blinding and sample selection before being ran (and some auxiliary data products).