Closed slosar closed 1 year ago
The missing piece here is a (simplified) systematics model, describing bias, as you mention, but also potentially other things like multiplicative biases that we might want to include in a covariance. (I think the other things that will generate a fiducial power spectra all come from the cosmology).
So I think you're actually asking for an API/data structure for that.
Thank you for writing down the concrete proposal @slosar . I agree with most of what you wrote. I do think it will be better if the window is a part of the sacc data structure given the possibility of large number of tracers / bins etc and other possible use cases for the window (e.g. modeling pseudo-Cl). If you are worried about file size, we can keep it separate too.
I also agree with what @joezuntz mentioned. For that, we have a parser function which can read in the default parameters from a yaml file. I tried to keep it consistent with Firecrown/Txpipe etc, so we should be able to read in the same yaml file. Path to that file should be an argument of TJPCov. In terms of modifications to the covariance during the analysis (e.g. to do parameter marginalization with Gaussian priors), I think that choice and modification might be better made inside something like Firecrown? TJPCov should just return the default observable covariance. We will probably want to avoid computing the same covariance twice for two different analysis choices.
@joezuntz Jo, what do you think about saving the window function in the sacc file? Note that this is potentially a high-resolution healpix file per tracer, which would blow up the size of the sacc. Would that work? Do we need to extend sacc or can this fit within the current scheme (I don't think so).
Also, if we store the windows, we could also store the fiducial power spectrum in sacc. Or perhaps we could just, at least initially, smooth the observed power spectrum and use that as the signa power for the purpose of covariance matrix?
Storing windows in sacc sounds dangerous to me. These sacc files will be mostly used for likelihood etc., and creating huge files just for covariance calculation seems wasteful to me. This will also duplicate storage, since the window functions are surely gonna be stored somewhere else too. TXPipe will have direct access to these windows and therefore could pass those directly when initializing the covariance calculator. firecrown could read these from files listed in the corresponding section of each tracer in the yaml file.
Edit: moreover, healsparse has already defined a storage format for large maps that would be inefficient to duplicate in sacc, I think.
Unrelated, just to make sure we're on the same page: we started thinking about what we need for halo model NG covariances in CCL (see here).
I also think that putting the windows in another place than sacc is better. These can be stored rather in a file that can be accessed by more than one other modules/software as needed like TXPipe/Firecrown etc.
Sorry, I'm getting little confused about sacc. Sacc files already contain nz, which is the window along the z direction. Why do we want to treat the angular window differently? Most of the times when one makes different choices in the analysis (e.g. magnitude cuts, photoz cuts, photometry cuts, etc.) the window changes. So the number of Sacc files and window files we have will not be too different.
I agree that the window itself does not have to contained inside the Sacc data structure. I thought Sacc was to be designed such that one can only load the things needed in the memory and the rest of data stays on the disk. But I don't know much about implementations of such things, so I won't argue on this.
If window is too big for Sacc, it can be stored elsewhere, but Sacc should contain information about the window of the datasets that was used to compute the 2-point functions. So we can perhaps store the path to window files as a string?
For context, we will want to resolve sky masks to ~arcmin resolution at least. This means ~1GB per window function (more if we want smaller resolution, which I think is likely).
The problem with storing file paths is that then the sacc file is not very portable.
Who are the users of the angular windows? If they are just for generating a fixed cosmology which we would then distribute then I don't think we should put them in sacc. if anyone doing parameter estimation needs them then they should.
In my opinion, saying that sacc should contain information about window is same as saying sacc should know about the dataset that was used to create it. As I said earlier, full window does not need to live inside the sacc itself, a pointer/path to it may suffice. I understand the argument about portability in case we use paths. But if we are generating sacc+window on one machine and analyzing it on another, then in my opinion that is an argument in favor of window staying inside sacc.
Angular windows can matter when modeling the two point functions. Pseudo-Cl is an obvious example. For an example of an effect on correlation functions, see appendix D of https://arxiv.org/pdf/1811.06499.pdf . Depending on the analysis choices, we may need the window to apply a simple correction or to bin the theory etc. For most of these cases, we will likely need the correlation function / power spectra of the window. So if file size is an issue, we can save the two point functions of the window rather than full map. The maps may still be needed for some tests / modeling for higher order effects, etc.
For covariance, we need more as covariance cares about different combinations of 4-windows. Those also enter as two point functions under certain approximations and can in principle be computed and then stored. It will be more complicated and better if handled by TJPcov. We will likely have a routine within TJPcov to compute the required window correlations and we can design it such that it can called upon anytime and will return all the return two-point window correlation needed for covariance. If sacc only contains information needed during parameter estimation, I think saving these two point functions is not a preferred choice.
As an aside, I think I'm missing something. Since file size seems to be an argument against storing windows, is ~20-50GB for storing sacc+window too much when we will be using TBs of space for some other data products? Remember the number of sacc files and window files will in principle be very similar.
Finally, if people think the discussion is veering off topic or have strong feelings about saac, I'm ok with going with Anze's original proposal.
Hi Sukhdeep, so one of the main points of sacc was to so much housekeeping, but in particular portability, so that you can efficiently send the LSST 3x2 point to a collaborator or put it on a website so that people can download.
What about this compromise:
@slosar Your proposal sounds good to me.
To reply some of the questions in the updated original post from @slosar, the bias parameters, such as galaxy bias, and which bias model to use, should handled in a similar way as the rest of the pipeline (e.g. TXPipe, fire crown etc.). We should use same power spectra model that will be used in analysis. This is relevant for galaxy shear also, eg. if we have different baryon models.
I assumed that fiducial cosmology is passed in a yaml file and the tracer information is in sacc, so in example notebooks we have a simple yaml parser and then define our own CCL cosmology and tracer objects. Passing a CCL cosmology object is fine too. Power spectra models/parameters should be identifiable either from cosmology object or input parameter (yaml) file and TJPCov should be able to define required CCL objects and call the right ccl functions.
There is already a CCL method to read a cosmology from a yaml file - not sure from the above if you're using that. The tracer info is indeed in the sacc, and we have a separate metadata file in case useful with info like the sky area.
I was not aware of the CCL method. I copied the parser from a fire crown example. We should use the CCL method in this case. So in API, input should be a yaml file and the CCL parser should handle bias parameters, power spectra definitions, etc. We will generate CCL tracer object from sacc and then pass those along to get the required power spectra. Does that sound ok, @slosar @joezuntz ?
This issue is to discuss the concrete proposal for the API for TJPCov to be implemented by Felipe and Sukhdeep as discussed in the meet up yesterday.
A few uncontroversial notes:
The basic object should be
CovarianceCalculator
This object is initialized by specifying:Cosmology
objectsacc
tracersHealSparse
formatThe window function could be part tracer and optionally just shoved in
sacc
. I'm not very keen on this but it is becoming clearer why sukhdeep thinks this is a good idea. The window function could also be optionally just fsky specifications.This object would then have a method
get_gaussian_covariance
which will then return the covariance in the appropriate format. It would also have things likeget_connected_covariance
(for connected part) andget_ss_covariance
(for super-sample contribution). These will by default take information from sacc, but one would be able to manually specify bins. They will also have abackend
optional argument to allow using alternative engines.The remaining issues:
All in all, the docstrings should look something like: