LSSTDESC / TXPipe

Pipeline elements for 3x2pt analysis (shear-shear, shear-density, density-density) for DC2
BSD 3-Clause "New" or "Revised" License
19 stars 7 forks source link

TXTwoPointFourier #9

Closed joezuntz closed 5 years ago

joezuntz commented 6 years ago

A Fourier space version of two point measurement.

Needs to cope with metacalibration and also to run at scale of course.

Current plan is to wrap NaMaster.

damonge commented 6 years ago

@joezuntz Currently implementing this. A few questions:

  1. Do we want this to be an alternative method to TXTwoPointReal? Currently there's only one twopoint.py file, which @ellongley used to implement the correlation function stuff. I'm wondering whether both methods will end up living in the same file or whether we should use create different files that twopoint.py then calls depending on the input.
  2. Should the inputs of TXTwoPointFourier be the same as TXTwoPointReal? The Fourier-space method will require a map of deltag and gamma{1,2} at some point, which means we need to transform catalogs into maps. My question is whether we want this to be a separate stage in the pipeline or simply a method that lives in TXTwoPointFourier.
  3. TXTwoPointFourier will depend on parts of the pipeline that have not been written yet (e.g. the diagnostics maps or the mask). I will assume a minimal input of these, and TXTwoPointFourier will probably need to be modified later on once the previous stages have been populated. Just wanna check that this is OK with you.
damonge commented 6 years ago

@joezuntz More questions:

  1. There doesn't seem to be a stable "DiagnosticMaps" class. It seems TXDiagnosticMaps is intended to output a DiagnosticMaps object from descformats, but that class doesn't have the same kind of functionality that pixel_schemes has (e.g. it only supports healpix). I'm happy to contribute to descformats' DiagnosticMaps to make it more general, but I want to check that this is the intended way forward.
  2. I am not sure what the format of shear_catalog and tomography_catalog is supposed to be. @ellongley's implementation of TXTwoPointReal seems to assume it's store as an HDF5 file, while your implementation in sysmaps.py assumes it'll be FITS. Which is it?
joezuntz commented 6 years ago

@damonge thanks for this!

Starting with first one:

  1. Yes, this should be an alternative to TXTwoPoinReal. Most of the work on that is in a branch, TXTwoPointReal.

  2. Good question! Since the maps are interesting for other purposes I'd suggest splitting them out into a separate Stage.

  3. Yep. great.

Next set:

  1. At the moment the DiagnosticMaps class will validate non-healpix maps but doesn't have the two convenience methods to turn them into more useful data (i.e. it will generate a complete healpix array but not turn the pixel/value numbers into a 2D array yet). Adding to that would be v useful - thanks!

  2. tomography_catalog should always be HDF. I've assumed that the shear_catalog is FITS for now, since that's what we have at the moment (we can update this later). I've put examples of all of these here: /global/projecta/projectdirs/lsst/groups/WL/users/zuntz/data/examples-1/ Broadly the tomo catalog has an array /tomography/bin with bin numbers in for each galaxy.

damonge commented 6 years ago

Thanks a lot, that makes sense!

OK, about my question 1.2 (i.e. whether maps should be calculated in a previous stage): I'm actually gonna assume that this is not the case, and will generate those maps at this stage. I think this makes sense if the real-space counterpart is doing the analogous work of splitting maps into tomographic bins. I will however assume that the mask and some systematics maps are part of the input, and those will set the pixelization scheme to be used for the density/gamma maps.

One other question: I am not sure what the difference between tomography_catalog and shear_catalog is. What additional information is there in tomography_catalog that isn't in shear_catalog?

Related question: let's say we want to use different samples for shear and clustering (as will probably be the case). Should we assume different input catalogs for each of them? If so, are both shear_catalogs (even the lenses) and have associated tomography_catalogs? It might be that the answer to this question is "we don't want to assume different samples for the time being"

joezuntz commented 6 years ago

That's fine re: the maps - we can look at splitting things up later.

What additional information is there in tomography_catalog that isn't in shear_catalog?

The tomo catalog just contains:

It doesn't repeat the other info in the shear cat. Since we may want to experiment with different binnings I wanted to keep these separate.

Related question

Yep, we'll definitely want to use different catalogs for the two sources eventually. But let's assume they are the same catalogs for now.

damonge commented 6 years ago

OK, sounds good.

The most critical part to begin with is generalizing DiagnosticMaps. I am kind of hesitant regarding where to do this, since part of the functionality lives in descformats and part of it lives here (e.g. the pixelization functions and the statistics stuff you wrote).

I'm tempted to go for the easiest solution: implement everything in TXPipe and let you decide afterwards which parts should migrate to descformats. Does that sound OK?

joezuntz commented 6 years ago

implement everything in TXPipe and let you decide afterwards which parts should migrate to descformats. Does that sound OK?

Yes, that's a fine way forward. It's easier to move code than to write it.

damonge commented 6 years ago

OK, I'll get to work now.

Final question I just thought about:

shear calibration values for each bin

Sorry for my ignorance about lensing: is this the same value for each bin? If so, we may be incurring in a lot of redundancy. You'd only need bin number for each object and then a small dictionary between bin numbers and shear calibration values. Maybe this is irrelevant.

joezuntz commented 6 years ago

There's two terms in the calibration values, one for selection, R_s, which quantifies the bias associated with cutting to this specific tomography bin and SNR/size cuts, which is (n_bin x 2 x 2), since we have it for e1, e2, and the cross terms.

The second is R_gamma, which is the noise and model bias on the shear per object. It's (n_obj x 2 x 2), and it can't be reduced to a single per-bin number because you need to include it per-object in the 2pt calculations.

For the case of going via a map as you're doing that's fairly straightforward - for each pixel you compute the weighted average from eqn 4. of https://arxiv.org/pdf/1708.01533.pdf It's a bit more of a pain for correlation function measurements direct from catalogs.

We should probably worry about the higher noise on each pixel's R value because there's only a limited number of objects in it. Need to think this through if we want to go down to e.g. arcmin pixels.

damonge commented 6 years ago

OK, here's a rant: it's kind of annoying to have tomography_catalog and shear_catalog separately. What is the reason for this? Can we at least assume that the ordering in both catalogs is the same? If not, iterating through them simultaneously is quite impossible (unless I'm being stupid, which is quite possible)

joezuntz commented 6 years ago

There's a reason for this - we will want to keep a fixed shear catalog for many different analyses, but the tomography catalog will change as we select different objects and different redshift binnings - it changes much more quickly, and we don't want to copy the entire shear catalog around too much unless we have too, as it's pretty big.

They are constructed so that the ordering is the same, so you can just iterate through them together.

BTW - something that might be useful for constructing a map - there's a class ParallelStatsCalculator txpipe/utils/stats.py that builds up the mean and variance of a map without needing to load the whole thing at once.

damonge commented 6 years ago

OK, great.

Yes, already using that! ;-)

damonge commented 6 years ago

OK @joezuntz , a new day, a whole new set of questions. I have written mapper that makes use of ParallelStatsCalculator, and I want to make sure what I'm doing makes sense. This mapper needs to generate complete maps that are loaded in memory, since NaMaster is not MPI-parallelized (it is OMP though), and can't deal with maps that are distributed across nodes.

  1. Can I assume that, if I pass sparse=False when initializing a ParallelStatsCalculator, the calculate method will return arrays of size size (where size is the quantity passed at initialization too)?
  2. If sparse=False, will MPI still work? I.e. will the histogramming carried out by calculate be done in parallel and then the results merged into a single histogram?

I could probably figure all of this out by reverse-engineering the code, but I though I'd ask first rather than spend 2h doing that.

joezuntz commented 6 years ago
  1. Yep, that's right.

  2. Yep, MPI will work with sparse=False.

joezuntz commented 6 years ago

need to think about how MPI vs OpenMP will work - may be fiddly to use both

joezuntz commented 5 years ago

Loads of progress on this by @ellongley at the sprint week and before - this is now running nicely. Still need to add the calibration.

joezuntz commented 5 years ago

Closing to open more specific sub-issues