API propopsal - Githubissues

slosar commented 3 years ago

This issue is to discuss the concrete proposal for the API for TJPCov to be implemented by Felipe and Sukhdeep as discussed in the meet up yesterday.

A few uncontroversial notes:

TJPCov should be basically a library to be used by both TXPipe and augur and potentially also firecrown
It should also has a stand-alone driver for quick check.
It will have CCL and sacc as a dependency and it will be dependency in the pieces of software that use it.

The basic object should be CovarianceCalculator This object is initialized by specifying:

fiducial cosmology as CCL Cosmology object
a set of sacc tracers
a set of window functions in HealSparse format

The window function could be part tracer and optionally just shoved in sacc. I'm not very keen on this but it is becoming clearer why sukhdeep thinks this is a good idea. The window function could also be optionally just fsky specifications.

This object would then have a method get_gaussian_covariance which will then return the covariance in the appropriate format. It would also have things like get_connected_covariance (for connected part) and get_ss_covariance (for super-sample contribution). These will by default take information from sacc, but one would be able to manually specify bins. They will also have a backend optional argument to allow using alternative engines.

The remaining issues:

you need to somehow specify the fiducal power spectra for tracers
are we just assuing a linearly biased power spectrum?
how do we pass the bias parameters?

All in all, the docstrings should look something like:



class CovarianceCalculator:
    def __init__ (self, C, sacc, window = None)
    """
    Parameters
    ----------
    C : CCL Cosmology object
        Fiducial Cosmology used in calculations
    sacc : sacc file
        sacc object containing the tracers and binning to be used in covariance calculation
    window : None, dict, float
        If None, used window function specified in the sacc file
        if dict, the keys are sacc tracer names and values are either HealSparse inverse variance
        maps 
        if float it is assumed to be f_sky value
    """    

    def get_gaussian_covariance(self,backed='default', **kwargs):
    """
    Returns the disconnected part of covariance

    Parameters
    ----------
    backend : str
        'default' for the built-in engine
        'NaMaster' for calculation using NaMaster
    **kwargs :
        optional back-end dependend options

    Returns:
    cov : 2D numpy array
        Returns covariance matrix as a numpy array for all bins defined in the sacc file
    """

    def get_sacc_with_cvoariance(self, gaussian = True, connected = False, SSC = False,  backend='default', **kwargs):
    """
    Returns the sacc file with recomputed covariance matrix. 

    Parameters
    ----------
    gaussian: bool
        Calculate Gaussian part of covariance matrix
    connected: bool
        Calculated the connected part of the covariance matrix
    SSC : bool
        Calcualte the super-sample part of covariance matrix
    backned: str
        Which backed to use -- see docstring for get_gaussian_covariance
    **kwargs :
        optional back-end dependend options

    Returns:
    sacc : sacc object
       sacc object that was specified initially with a recalculate covariance matrix
    """

joezuntz commented 3 years ago

The missing piece here is a (simplified) systematics model, describing bias, as you mention, but also potentially other things like multiplicative biases that we might want to include in a covariance. (I think the other things that will generate a fiducial power spectra all come from the cosmology).

So I think you're actually asking for an API/data structure for that.

sukhdeep2 commented 3 years ago

Thank you for writing down the concrete proposal @slosar . I agree with most of what you wrote. I do think it will be better if the window is a part of the sacc data structure given the possibility of large number of tracers / bins etc and other possible use cases for the window (e.g. modeling pseudo-Cl). If you are worried about file size, we can keep it separate too.

I also agree with what @joezuntz mentioned. For that, we have a parser function which can read in the default parameters from a yaml file. I tried to keep it consistent with Firecrown/Txpipe etc, so we should be able to read in the same yaml file. Path to that file should be an argument of TJPCov. In terms of modifications to the covariance during the analysis (e.g. to do parameter marginalization with Gaussian priors), I think that choice and modification might be better made inside something like Firecrown? TJPCov should just return the default observable covariance. We will probably want to avoid computing the same covariance twice for two different analysis choices.

slosar commented 3 years ago

@joezuntz Jo, what do you think about saving the window function in the sacc file? Note that this is potentially a high-resolution healpix file per tracer, which would blow up the size of the sacc. Would that work? Do we need to extend sacc or can this fit within the current scheme (I don't think so).

Also, if we store the windows, we could also store the fiducial power spectrum in sacc. Or perhaps we could just, at least initially, smooth the observed power spectrum and use that as the signa power for the purpose of covariance matrix?

damonge commented 3 years ago

Storing windows in sacc sounds dangerous to me. These sacc files will be mostly used for likelihood etc., and creating huge files just for covariance calculation seems wasteful to me. This will also duplicate storage, since the window functions are surely gonna be stored somewhere else too. TXPipe will have direct access to these windows and therefore could pass those directly when initializing the covariance calculator. firecrown could read these from files listed in the corresponding section of each tracer in the yaml file.

Edit: moreover, healsparse has already defined a storage format for large maps that would be inefficient to duplicate in sacc, I think.

Unrelated, just to make sure we're on the same page: we started thinking about what we need for halo model NG covariances in CCL (see here).

mishakb commented 3 years ago

I also think that putting the windows in another place than sacc is better. These can be stored rather in a file that can be accessed by more than one other modules/software as needed like TXPipe/Firecrown etc.

sukhdeep2 commented 3 years ago

Sorry, I'm getting little confused about sacc. Sacc files already contain nz, which is the window along the z direction. Why do we want to treat the angular window differently? Most of the times when one makes different choices in the analysis (e.g. magnitude cuts, photoz cuts, photometry cuts, etc.) the window changes. So the number of Sacc files and window files we have will not be too different.

I agree that the window itself does not have to contained inside the Sacc data structure. I thought Sacc was to be designed such that one can only load the things needed in the memory and the rest of data stays on the disk. But I don't know much about implementations of such things, so I won't argue on this.

If window is too big for Sacc, it can be stored elsewhere, but Sacc should contain information about the window of the datasets that was used to compute the 2-point functions. So we can perhaps store the path to window files as a string?

damonge commented 3 years ago

For context, we will want to resolve sky masks to ~arcmin resolution at least. This means ~1GB per window function (more if we want smaller resolution, which I think is likely).

The problem with storing file paths is that then the sacc file is not very portable.

joezuntz commented 3 years ago

Who are the users of the angular windows? If they are just for generating a fixed cosmology which we would then distribute then I don't think we should put them in sacc. if anyone doing parameter estimation needs them then they should.

sukhdeep2 commented 3 years ago

In my opinion, saying that sacc should contain information about window is same as saying sacc should know about the dataset that was used to create it. As I said earlier, full window does not need to live inside the sacc itself, a pointer/path to it may suffice. I understand the argument about portability in case we use paths. But if we are generating sacc+window on one machine and analyzing it on another, then in my opinion that is an argument in favor of window staying inside sacc.

Angular windows can matter when modeling the two point functions. Pseudo-Cl is an obvious example. For an example of an effect on correlation functions, see appendix D of https://arxiv.org/pdf/1811.06499.pdf . Depending on the analysis choices, we may need the window to apply a simple correction or to bin the theory etc. For most of these cases, we will likely need the correlation function / power spectra of the window. So if file size is an issue, we can save the two point functions of the window rather than full map. The maps may still be needed for some tests / modeling for higher order effects, etc.

For covariance, we need more as covariance cares about different combinations of 4-windows. Those also enter as two point functions under certain approximations and can in principle be computed and then stored. It will be more complicated and better if handled by TJPcov. We will likely have a routine within TJPcov to compute the required window correlations and we can design it such that it can called upon anytime and will return all the return two-point window correlation needed for covariance. If sacc only contains information needed during parameter estimation, I think saving these two point functions is not a preferred choice.

As an aside, I think I'm missing something. Since file size seems to be an argument against storing windows, is ~20-50GB for storing sacc+window too much when we will be using TBs of space for some other data products? Remember the number of sacc files and window files will in principle be very similar.

Finally, if people think the discussion is veering off topic or have strong feelings about saac, I'm ok with going with Anze's original proposal.

slosar commented 3 years ago

Hi Sukhdeep, so one of the main points of sacc was to so much housekeeping, but in particular portability, so that you can efficiently send the LSST 3x2 point to a collaborator or put it on a website so that people can download.

What about this compromise:

we store the URL to the window function in sacc.
the url would be a simple 'file://' for internal use on nersc, but will point a downloadable site when we release the data or when sharing more widely with people inside collaboration but outside NERSC.
this would have an advantage in that other people could reproduce also TJPCov piece
you could imagine storing main and backup URL, with main being a file:// and a backup being an http:// or ftp:// .
optional: actual sacc library could have a simple caching scheme by which it would download and remember the file for the future

sukhdeep2 commented 3 years ago

@slosar Your proposal sounds good to me.

sukhdeep2 commented 3 years ago

To reply some of the questions in the updated original post from @slosar, the bias parameters, such as galaxy bias, and which bias model to use, should handled in a similar way as the rest of the pipeline (e.g. TXPipe, fire crown etc.). We should use same power spectra model that will be used in analysis. This is relevant for galaxy shear also, eg. if we have different baryon models.

I assumed that fiducial cosmology is passed in a yaml file and the tracer information is in sacc, so in example notebooks we have a simple yaml parser and then define our own CCL cosmology and tracer objects. Passing a CCL cosmology object is fine too. Power spectra models/parameters should be identifiable either from cosmology object or input parameter (yaml) file and TJPCov should be able to define required CCL objects and call the right ccl functions.

joezuntz commented 3 years ago

There is already a CCL method to read a cosmology from a yaml file - not sure from the above if you're using that. The tracer info is indeed in the sacc, and we have a separate metadata file in case useful with info like the sky area.

sukhdeep2 commented 3 years ago

I was not aware of the CCL method. I copied the parser from a fire crown example. We should use the CCL method in this case. So in API, input should be a yaml file and the CCL parser should handle bias parameters, power spectra definitions, etc. We will generate CCL tracer object from sacc and then pass those along to get the required power spectra. Does that sound ok, @slosar @joezuntz ?

LSSTDESC / TJPCov

API propopsal #5