Pipeline cleanup - Githubissues

damonge commented 5 years ago

Rewrote our initial pipeline as a ceci pipeline, and cleaned it up significantly. Also made it possible to download all the necessary data automatically. I have checked that, running the new pipeline with the same parameters, I obtain exactly the same outputs at every stage down to floating-point accuracy, so I am confident that the new code is correct (or as correct as the initial one). Before merging into master, however, it would be good for the people who coded up some parts of the pipeline to check the new code briefly:

@humnaawan @fjaviersanchez : the codes for creating depth maps etc. are almost literal copies of the ones you wrote initially, so there's no need to check those.
@humnaawan : have a quick look at hsc_lss/pdf_match.py, where I've re-implemented your pdf matching, and at the last part of data_query/get_data.py, where I'm downloading all the photo-z data.
@zahragomes : have a quick look at hsc_lss/cosmos_weight.py, where I've re-implemented the algorithm you wrote to get matches to the COSMOS-30band photometry data and compute associated color-space weights.
@adam-broussard : have a quick look at get_nz_stack in hsc_lss/cat_mapper.py, where I've implemented the pdf-stacking algorithm you wrote initially. I've modified it a bit, but I think it should give similar results (visually they are indistinguishable).

zahragomes commented 5 years ago

I had a look at your cleaned up cosmos_weight.py file, everything looks good, can’t see any problems. Also, I saw you used a leafsize of 2*n_neighbours to when finding the neighbouring galaxies(see below).. is this some sort of standard practice, or did you just choose 2? (I’m just curious)

tree_NN_lookup = spatial.cKDTree(photoz_sample, leafsize=2*self.config['n_neighbors’])

damonge commented 5 years ago

Thanks a lot @zahragomes ! Regarding the leafsize: good question. I just saw you were using 20 neighbors and a leafsize of 40, so I assumed that was a rule of thumb. Should I just leave it hardcoded to 40?

zahragomes commented 5 years ago

I don't think it's going to matter here. the leafsize just gives the number of samples below which the algorithm will start using the brute force method of finding neighbours (find the distance between all pairs), instead of the much more efficient KDtree method which reduces the number of distances to be calculated. At small numbers of samples(galaxies), brute force can be more efficient than KDtree, so it makes sense to switch to brute force, but in our case our photoz_sample will always be much to big for this to matter. So as long as we don't start using very large n_neighbours, 2*n_neighbours should be fine.

damonge commented 5 years ago

OK, great. I'll revert it to 40 for safety (just in case at some point we do explore large n_neighbours, although I doubt it).

@humnaawan , @adam-broussard : let me know your thoughts when you've had a chance to look at the code!

adam-broussard commented 5 years ago

@damonge I think your code looks good with one potential stipulation. I used np.trapz() when I was summing up the PDF and I noticed that you are just using sum(). Does that not create an issue where the answer could be dependent on bin size?

damonge commented 5 years ago

@adam-broussard thanks a lot!

I think that might be the case depending on the type of interpolation one uses. However, what I did is plot my N(z)s vs. yours and saw that they were essentially equivalent (e.g. O(1E-3) differences). This may be marginally less accurate but it's a bit faster. Let me know if this concerns you.

adam-broussard commented 5 years ago

@damonge Okay, I remember mine took a decent chunk of time to run, so yours is probably the better method. Looks good!

humnaawan commented 5 years ago

@damonge I think the new code looks good. my apologies for the delay in getting back.

damonge commented 5 years ago

Awesome. Thanks a log @humnaawan and @adam-broussard !

LSSTDESC / DEHSC_LSS

Pipeline cleanup #40