Calibrated N(z) from pdf stacking and z_MC

damonge commented 6 years ago

Write code that produces an estimate of N(z) from the stacking of either the pdfs or the values of z_MC of all galaxies in the sample. Ideally this should be implemented in cat_sampler.py (but feel free to disagree). See documentation in README.

This work may/could/should involve pdf correction through PIT

Tagging @adam-broussard @humnaawan as people that have started pioneering this.

humnaawan commented 6 years ago

Here's my code: get_pdfs.sh gets the PDFs from the server and match_pdfs.py matches the object IDs with the cleaned catalog and saves the matched pdfs. Currently the pdfs, bins, and ids are being saved independently as fits files; the data is sitting in /global/cscratch1/sd/awan/hsc_pdfs and /global/cscratch1/sd/awan/hsc_matched_pdfs on NERSC. I need to look into how to add multiple columns to the same fits file, possibly appending the catalog fits files in /global/cscratch1/sd/damonge/HSC/HSC_processed. @damonge I am thinking that we should keep the PDFs as an independent file, not appending to the catalog fits files, but thats just a format preference.

@adam-broussard I briefly looked at your code for matching HSC pdfs. As I understand it, you are not matching with processed the catalog, but with the entries in a .csv catalog (which I assume you generated using an SQL query)? My apologies if I'm mistaken; my coding style is a little different so I'm slow to parse OO code without documentation.

adam-broussard commented 6 years ago

Hi @humnaawan,

When you say that the catalog is processed, what does that mean? You are correct that hsc_reader pulls a csv file generated by a SQL query, and I can provide the selection criteria if you are interested, but they're mainly things like selecting primaries, not having bad pixels, and selecting only objects in the COSMOS field. If you have a different catalog I should be using, I can switch around my code to work with that instead! I apologize for the lack of documentation - I'll probably be spending some time tomorrow going back through and commenting everything now that other people are looking at my work!

damonge commented 6 years ago

OK, thanks a lot for starting the conversation. Here's my 2c:

I think it's more convenient to work with downloaded files than having to submit SQL queries every time we need to recompute the N(z), so I would advocate for that (@adam-broussard , please let me know if this is not correct).
Ideally, it'd be great to have an easy way of connecting each object in the clean catalogs with its pdf, without having to match IDs every time. So I wonder if we could have a 2-step process. First, we download all the pdfs for each field. Then, we match pdfs and catalog objects as part of the cleaning process, and store all the information in one final file (or set of files). The cleaning is currently done by process.py, so I wonder if it'd be possible to modify that script to do the matching there?
In terms of storing the clean catalogs and pdfs, I see two possibilities: either we put everything on the same file or we store the pdfs in a separate file, but preserving the same ordering as in the clean catalog files, so matching objects and pdfs is automatic afterwards. Either way is fine, and the latter has the advantage that we don't have to read in all the pdf data if we don't need it, as @humnaawan advocates. @humnaawan , I think it should be straightforward to write pdfs in FITS tables, you just have to create a Column object for which the array argument is a 2D array with dimensions [n_obj,141] containing all the pdfs to be stored (here n_obj is the total number of objects in the catalog). Let me know if this makes sense (maybe you've already tried this and it's not so easy?).

So the proposal above is just for organizational purposes: it allows us to have a set of clean files with all of the information we need in order to produce the next-level products (maps and then power spectra). This is completely open to discussion, so let me know what you think.

adam-broussard commented 6 years ago

This may sound silly, but I didn't realize I could directly download the entire catalog in the first place, so I'll check that out and modify my code accordingly. I agree that that will be the best way to move forward.
This is the method my code employs already. It doesn't store the PDFs themselves, but it stores the matching indices in the HSC catalog and each PDF file so that it can quickly pull the relevant data when an object is instantiated. I'll look into piping the output of process.py to my code if that is how we would like to proceed. I have found that running on my laptop, it takes about 1 hour to generate the cache files for each code run, but once they are generated, it reads the data in more or less instantaneously.
This sounds more like Humna's approach of storing the cleaned HSC catalog and matching PDFs in fits files, so maybe we can then take the outputs of my code and feed them into something Humna has to write everything out nicely to fits files. Keep in mind that the number of redshift bins for the PDFs is not uniform across different codes though (and some of them are much longer than 141 entries).

Humna and I are planning on meeting tomorrow to look over what both of our codes do and it would be nice to be able to find a way to unify them so we are all working from the same base. We may be able to provide more information once we've had a chance to talk then!

humnaawan commented 6 years ago

@adam-broussard I've downloaded the HSC pdfs for four photo-z algorithms (ephor ephor_ab demp frankenz) for the deep_cosmos field (alongside the others:wide_aegis wide_gama09h wide_gama15h wide_hectomap wide_vvds wide_wide12h wide_xmmlss deep_elaisn1 deep_xmmlss deep_deep23), and have also matched them with their respective processed catalog (from /global/cscratch1/sd/damonge/HSC/HSC_processed). I am saving the matched PDFs as well as the bins and object_ids, although since the pdfs are ordered by the object_ids of the processed catalog, the saved object_ids are just for debugging purposes.

Please see the notebook that I just pushed for a way to read the saved fits files. I also check the ordering of the objects in the notebook, and things appear to be working. (My apologies for the weird formatting of the notebook outputs; they look fine when I look at them in JupyterLab .. )

The HSC pdfs (unzipped) are in /global/cscratch1/sd/awan/hsc_pdfs while the matched pdfs are in /global/cscratch1/sd/awan/hsc_matched_pdfs. Contents of both folders should be readable by anyone with lsst affiliation.

Please let me know what you think and/or if there are any problems.

adam-broussard commented 6 years ago

@humnaawan this looks great! Thank you for running those through.

damonge commented 6 years ago

@humnaawan this is great, thanks a lot! Couple of ideas for both you and @adam-broussard

It should be possible now to start using those pdfs to get N(z)s and compare them with the z_MC histograms, right? For the sake of consistency and self-containedness (is that even a word?) it would be a good idea to implement this in cat_sampler.py. This script currently takes a given processed catalog and provides delta_g maps and N(z)s for a given choice of tomographic binning. Right now only z_MC-based photo-zs are available and the idea would be to just extend this to allow for pdf stacking as another option to get N(z)s (i.e. the user would choose whether they want to use one or the other). Does this make sense? Or would you prefer the maps and the N(z)s to be generated independently?
For the sake of reproducibility it would be nice to a) provide some instructions as to where and how the pdfs were downloaded and b) to implement the matching as part of the cleaning process that generates the "processed" catalogs. This is what process.py does. Would it be too much work to extend it to do the pdf matching too? I'm honestly asking, let me know your thoughts. If this would be too distracting right now I'd be happy to implement it later on.

adam-broussard commented 6 years ago

@damonge I'll let @humnaawan answer your questions above because those sound like they mostly pertain to her!

@humnaawan and I had discussed this project earlier and we thought it might be a good idea if she generates an N(z) estimate by stacking the Wide field PDFs while I try to come up with a corrected N(z) in the Deep field by applying the PIT corrections I have been doing so far. Eric today noted that there are going to be small discrepancies between the Wide and Deep estimated N(z)'s because the photometric (and therefore photo-z) errors for the Deep catalog are smaller than those of the Wide catalog. This effect may be small, particularly if we're only looking to estimate galaxies in bins of the N(z) rather than looking for N(z) itself, but it's definitely something we'll want to keep in mind.

Does that sound like a reasonable plan? If @humnaawan has any corrections or input on anything I've said here, she can chime in as well!

damonge commented 6 years ago

Sounds like a good plan to me!

humnaawan commented 6 years ago

@damonge my understanding for the N(z) comparison was that we will be comparing four estimates of the N(z) to use in our clustering analysis of the WIDE galaxies:

Using the galaxies in the processed WIDE catalogs: stacking 1) photo_z_mc histograms, 2) pdfs
Using the galaxies in the processed DEEP_COSMOS catalog: stacking 3) PIT-corrected pdfs, 4) color-reweighted pdfs

Please correct me if I'm wrong!

Will we be using all four of the estimates in the clustering analysis? If yes, then it'd make sense to include them in the processing pipeline. If, however, we have only one clear preference, I am not sure whether the functionality of using the not-great estimates of N(z) would be helpful. I am not averse to the idea though.

For the id-matching, I'd think that once we decide on a specific photo-z algorithm, we can certainly add the id-matching to process.py. Otherwise, we could just keep the matched pdfs for different algorithms in their own files, especially since not all the algorithms have the same binning. Thoughts?

As for the reproducibility, here's an outline of my scripts (sitting here, in my branch of the repo):

get_pdfs.sh downloads and unzips the PDFs for different field and photo-z algorithm combinations.
bash_match_all_pdfs.sh writes sbatch files to run match_pdfs.py on each of the field/algorithm combination; thanks @fjaviersanchez for helping with the nice parallelization of the process.

For the permissions, I didn't think to write a script, but ran chgrp -R lsst <folder> and chmod -R g-r <folder> on both output folders: /global/cscratch1/sd/awan/hsc_pdfs and /global/cscratch1/sd/awan/hsc_matched_pdfs (which also contains the sbatch outputs). Please let me know if anything is unclear.

damonge commented 6 years ago

@humnaawan @adam-broussard about PITs, what's the point of using the PITs on the deep rather than the wide fields? I thought this only required to have some spec-zs, and for that we could use the ones available in the database. Is the point to use the COSMOS 30-band photometry for this? In that case there should be some coordination with @zahragomes to make sure we're all using the same matched catalog, and I would worry about whether PIT corrections work well when using photometric redshift data as the truth. (Sorry if there's been some confusion here, I think I read Adam's message above too quickly).

In terms of things that should be implemented in the pipeline, I would say that z_MC histograms and pdf stacking are so simple and closely related that we should include both as possible options in cat_sampler.py. On the other hand we could treat the PIT corrections as a validation technique that we will use to check our final N(z)s, and we could keep that separate (at least for the time being). Then, the reweighted COSMOS 30-band is a completely independent method, so it would be best to have it as part of the pipeline too, but we can coordinate that separately as part of issue #18 .

Finally, @humnaawan , thanks for the details about the pdf matching. If that stage requires that level of parallelization etc., I think it's not worth wasting time putting it as part of process.py. And I think it makes sense to have separate files for the pdfs as long as those are easy to associate with the catalog entries, which seems to be the case. So I think all looks good on that end!

Related question: do you know if all the objects in the processed catalogs have a matching pdf? Or are there orphan objects?

adam-broussard commented 6 years ago

@damonge I am using the spec-z's as "true" redshifts for the PIT corrections. My understanding is that galaxies in the Wide field typically do not have spec-z's, and additionally I assumed we would want to compare between the N(z)'s @zahragomes and I find, so I figured Deep was the way to go. For the PIT corrections I don't directly need the 30-band photometry data though!

humnaawan commented 6 years ago

@damonge during the id-matching, I check for any objects in the processed catalogs that do not have the pdfs. Based on the sbatch outputs, no object is missing a pdf.

I also compared the saved ids with the objects ids in processed catalog (see Out[5] in this notebook). Since the former are only saved if they have a match in the processed catalog, I considered this as another test to ensure that every galaxy in our catalog has a pdf.

damonge commented 6 years ago

@adam-broussard : OK, I see. I think there are some spec-zs in the GAMA fields, but it's probably quite limited. In that case, the plan sounds good to me! In that case you should make sure to use an HSC deep sample that at least has the same cuts as the ones we use for the WIDE analysis. I prepared a cleaned version of the HSC deep fields for Zahra (already with the same cuts as our default sample) and you can find them in:

/global/cscratch1/sd/damonge/HSC/HSC_processed/DEEP_COSMOS/DEEP_COSMOS_Catalog_i24.50.fits
/global/cscratch1/sd/damonge/HSC/HSC_processed/DEEP_DEEP23/DEEP_DEEP23_Catalog_i24.50.fits
/global/cscratch1/sd/damonge/HSC/HSC_processed/DEEP_ELAISN1/DEEP_ELAISN1_Catalog_i24.50.fits
/global/cscratch1/sd/damonge/HSC/HSC_processed/DEEP_XMMLSS/DEEP_XMMLSS_Catalog_i24.50.fits

Are you happy using those? Will @humnaawan produce the matched pdfs for these? Or has this already happened?

@humnaawan : OK, that's really great! I'm assuming implementing the pdf stacking in cat_sampler.py will not be too challenging in this case, but let me know if you have any questions.

adam-broussard commented 6 years ago

@damonge I had found the first catalog you list there, but not the others, so thanks for giving me the names! I believe @humnaawan has already generated the PDF matches for the deep catalogs, so I should be good to go - I've just been tied down by some back-to-school things the past few days, but I'm hoping to get some work done on this today and have some plots done by the end of the day!

damonge commented 6 years ago

All sounds good. You guys are pros!

humnaawan commented 6 years ago

@adam-broussard thanks for figuring out a nicer formatting of the fits files to save the fits files etc. I have incorporated your restructuring of the data in match_pdfs.py and have updated the matched fits files in /global/cscratch1/sd/awan/hsc_matched_pdfs. The older outputs (the no-so-convenient formatted ones) are now in /global/cscratch1/sd/awan/hsc_matched_pdfs_old just in case you need them.

damonge commented 6 years ago

Nice!

humnaawan commented 6 years ago

@adam-broussard I've matched the pdfs for the nnpz pdfs for all the fields. The output directory (/global/cscratch1/sd/awan/hsc_matched_pdfs) now has matched pdfs for five types of pdfs (ephor ephor_ab demp frankenz nnpz) for all the fields.

adam-broussard commented 6 years ago

Sorry that this is so late guys, I was running into some really bizarre errors I was having a lot of trouble with. At this time, I have made N(z) curves for the COSMOS Deep data with and without PIT fitting for the Frankenz and Ephor_ab codes. Nnpz is on the way, but @humnaawan has to wait until cori is back up tonight to make some edits to the files first. Let me know if anyone has any feedback! It looks to me like the PIT fitting does not make a huge change in N(z), but I haven't done galaxy clustering analysis from N(z) before so I'm not sure how big of a change will affect the analysis. tot_n_z

damonge commented 6 years ago

Great @adam-broussard ! It'll be good to redo this once we decide the binning scheme (#16 , which @humnaawan is taking care of).

It's interesting that, before PIT, both distributions showed a small bump between z=2 and 3, but that PIT gets rid of it. At the least it'll be good to check the effects of PIT (and of the different photo-z methods) on the final constraints.

I'll leave this issue open until we redo the exercise for the multi-bin case.

adam-broussard commented 6 years ago

We have determined that PIT analysis will likely not be an efficient way to move forward with the PDF stacking due to the low likelihood of significantly affecting N(z) and the inability to match PIT plots in the HSC photo-z paper. As a result, we're closing the issue. I have attached the final plots I produced which show the PIT plots and poor optimization of the necessary correction.

ephor_ab_pit_newint frankenz_pit_newint

damonge commented 6 years ago

OK, just copying here what @adam-broussard posted in the hackathon issue (#21) for consistency.

He's implemented pdf stacking on cat_sampler.py and we've checked the code runs as it should in the pipeline. Here are a few plots showing that (basically PDF stacking vs. z_MC histogramming):

zmc stacking

LSSTDESC / DEHSC_LSS

Calibrated N(z) from pdf stacking and z_MC #17