Closed joezuntz closed 5 years ago
@joezuntz Currently implementing this. A few questions:
@joezuntz More questions:
@damonge thanks for this!
Starting with first one:
Yes, this should be an alternative to TXTwoPoinReal. Most of the work on that is in a branch, TXTwoPointReal.
Good question! Since the maps are interesting for other purposes I'd suggest splitting them out into a separate Stage.
Yep. great.
Next set:
At the moment the DiagnosticMaps class will validate non-healpix maps but doesn't have the two convenience methods to turn them into more useful data (i.e. it will generate a complete healpix array but not turn the pixel/value numbers into a 2D array yet). Adding to that would be v useful - thanks!
tomography_catalog should always be HDF. I've assumed that the shear_catalog is FITS for now, since that's what we have at the moment (we can update this later). I've put examples of all of these here: /global/projecta/projectdirs/lsst/groups/WL/users/zuntz/data/examples-1/ Broadly the tomo catalog has an array /tomography/bin with bin numbers in for each galaxy.
Thanks a lot, that makes sense!
OK, about my question 1.2 (i.e. whether maps should be calculated in a previous stage): I'm actually gonna assume that this is not the case, and will generate those maps at this stage. I think this makes sense if the real-space counterpart is doing the analogous work of splitting maps into tomographic bins. I will however assume that the mask and some systematics maps are part of the input, and those will set the pixelization scheme to be used for the density/gamma maps.
One other question: I am not sure what the difference between tomography_catalog and shear_catalog is. What additional information is there in tomography_catalog that isn't in shear_catalog?
Related question: let's say we want to use different samples for shear and clustering (as will probably be the case). Should we assume different input catalogs for each of them? If so, are both shear_catalogs (even the lenses) and have associated tomography_catalogs? It might be that the answer to this question is "we don't want to assume different samples for the time being"
That's fine re: the maps - we can look at splitting things up later.
What additional information is there in tomography_catalog that isn't in shear_catalog?
The tomo catalog just contains:
It doesn't repeat the other info in the shear cat. Since we may want to experiment with different binnings I wanted to keep these separate.
Related question
Yep, we'll definitely want to use different catalogs for the two sources eventually. But let's assume they are the same catalogs for now.
OK, sounds good.
The most critical part to begin with is generalizing DiagnosticMaps. I am kind of hesitant regarding where to do this, since part of the functionality lives in descformats and part of it lives here (e.g. the pixelization functions and the statistics stuff you wrote).
I'm tempted to go for the easiest solution: implement everything in TXPipe and let you decide afterwards which parts should migrate to descformats. Does that sound OK?
implement everything in TXPipe and let you decide afterwards which parts should migrate to descformats. Does that sound OK?
Yes, that's a fine way forward. It's easier to move code than to write it.
OK, I'll get to work now.
Final question I just thought about:
shear calibration values for each bin
Sorry for my ignorance about lensing: is this the same value for each bin? If so, we may be incurring in a lot of redundancy. You'd only need bin number for each object and then a small dictionary between bin numbers and shear calibration values. Maybe this is irrelevant.
There's two terms in the calibration values, one for selection, R_s, which quantifies the bias associated with cutting to this specific tomography bin and SNR/size cuts, which is (n_bin x 2 x 2), since we have it for e1, e2, and the cross terms.
The second is R_gamma, which is the noise and model bias on the shear per object. It's (n_obj x 2 x 2), and it can't be reduced to a single per-bin number because you need to include it per-object in the 2pt calculations.
For the case of going via a map as you're doing that's fairly straightforward - for each pixel you compute the weighted average from eqn 4. of https://arxiv.org/pdf/1708.01533.pdf It's a bit more of a pain for correlation function measurements direct from catalogs.
We should probably worry about the higher noise on each pixel's R value because there's only a limited number of objects in it. Need to think this through if we want to go down to e.g. arcmin pixels.
OK, here's a rant: it's kind of annoying to have tomography_catalog and shear_catalog separately. What is the reason for this? Can we at least assume that the ordering in both catalogs is the same? If not, iterating through them simultaneously is quite impossible (unless I'm being stupid, which is quite possible)
There's a reason for this - we will want to keep a fixed shear catalog for many different analyses, but the tomography catalog will change as we select different objects and different redshift binnings - it changes much more quickly, and we don't want to copy the entire shear catalog around too much unless we have too, as it's pretty big.
They are constructed so that the ordering is the same, so you can just iterate through them together.
BTW - something that might be useful for constructing a map - there's a class ParallelStatsCalculator txpipe/utils/stats.py that builds up the mean and variance of a map without needing to load the whole thing at once.
OK, great.
Yes, already using that! ;-)
OK @joezuntz , a new day, a whole new set of questions. I have written mapper that makes use of ParallelStatsCalculator, and I want to make sure what I'm doing makes sense. This mapper needs to generate complete maps that are loaded in memory, since NaMaster is not MPI-parallelized (it is OMP though), and can't deal with maps that are distributed across nodes.
sparse=False
when initializing a ParallelStatsCalculator
, the calculate
method will return arrays of size size
(where size
is the quantity passed at initialization too)?sparse=False
, will MPI still work? I.e. will the histogramming carried out by calculate
be done in parallel and then the results merged into a single histogram?I could probably figure all of this out by reverse-engineering the code, but I though I'd ask first rather than spend 2h doing that.
Yep, that's right.
Yep, MPI will work with sparse=False.
need to think about how MPI vs OpenMP will work - may be fiddly to use both
Loads of progress on this by @ellongley at the sprint week and before - this is now running nicely. Still need to add the calibration.
Closing to open more specific sub-issues
A Fourier space version of two point measurement.
Needs to cope with metacalibration and also to run at scale of course.
Current plan is to wrap NaMaster.