Halo-centric pair counter

aphearin commented 8 years ago

When performing splashback-type calculations, or cluster velocity profiles, it is natural to attempt to count pairs with respect to halo-centric distances, as opposed to fixed absolute distances. In particular, one might want to count pairs of points with respect to r/Rhalo, as opposed to just r. Currently, there is no halotools feature that permits such counting. It seems natural to include this feature as an additional wfunc option in our formalism for calculating marked correlation functions, don't you think @duncandc?

CC @surhudm, @BenediktDiemer, and @a-kravtsov, who might find such ability useful in their work, and may have suggestions for alternative strategies that they have already implemented.

benediktdiemer commented 8 years ago

Interesting idea, I'm trying to figure out how this functionality would be used in practice. When looking at density profiles or the splashback radius, we take a particular sample of halos / galaxies / clusters and measure the DM / subhalo / member galaxy profile around them. So not quite the halo-halo or galaxy-galaxy correlation function, but either the halo-matter, central halo-subhalo, or BCG-member galaxy correlation functions, if you will.

How would these more specific quantities be expressed in the language of marked correlation functions?

manodeep commented 8 years ago

@aphearin Why not just return the pair indices + the separation for all pairs within certain rmax? Clearly, memory requirements will be much larger, but that can be improved by implementing the functionality in an iterator.

This way the user can impose whatever function they want to evaluate. Plus, the actual rmax used could then be refined to be a function of some pair property.

aphearin commented 8 years ago

@manodeep - excellent suggestion. I'd been meaning to implement such an iterator for a while: the direct analogy of the group_member_generator I have already implemented. I just didn't make the connection that this would apply to such a calculation, but it clearly does. Using a python generator won't be as fast as doing the calculation fully in cython, but for this particular use-case it will never be done inside an MCMC so the vastly improved flexibility of a generator seems worth it. Great tip!

surhudm commented 8 years ago

@aphearin I have not checked the halotools pair counter engine for quite some time now, but can you remind me exactly how you proceed with the paircounts for halo-matter cross-correlations, in particular, do you count particles around halos, or halos around particles? If it is the former, it should be fairly easy to deal with, as all you have to do is change the actual distance Rmax within which pairs are returned, where Rmax will be a function of the halo. For the latter case, you will have to find out the maximum virial radius of the halos under consideration.

On Tue, Apr 19, 2016 at 7:39 AM, Andrew Hearin notifications@github.com wrote:

@manodeep https://github.com/manodeep - excellent suggestion. I'd been meaning to implement such an iterator for a while: the direct analogy of the group_member_generator I have already implemented. I just didn't make the connection that this would apply to such a calculation, but it clearly does. Using a python generator won't be as fast as doing the calculation fully in cython, but for this particular use-case it will never be done inside an MCMC so the vastly improved flexibility of a generator seems worth it. Great tip!

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/astropy/halotools/issues/472#issuecomment-211613743

manodeep commented 8 years ago

@aphearin You should still be able to use Cython. Imagine an yield within a for loop over the total number of cells, with the actual pair indices, for that cell and all of its neighbouring cells, being computed in Cython. Seems do-able..

manodeep commented 8 years ago

@surhudm The problem is that would require re-gridding the entire particle set for each halo. It's much better to use the largest rmax you might possibly want, say 10 Mpc, and then return all pairs. That list can trivially be pruned to the appropriate halo-dependent rmax.

surhudm commented 8 years ago

@manodeep I do not understand why this requires regridding. I think @aphearin already moved away from using cell sizes which depend upon Rmax.

On Tue, Apr 19, 2016 at 10:01 AM, Manodeep Sinha notifications@github.com wrote:

@surhudm https://github.com/surhudm The problem is that would require re-gridding the entire particle set for each halo. It's much better to use the largest rmax you might possibly want, say 10 Mpc, and then return all pairs. That list can trivially be pruned to the appropriate halo-dependent rmax.

— You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub https://github.com/astropy/halotools/issues/472#issuecomment-211661245

manodeep commented 8 years ago

@surhudm Hmm I don't think that can be true -- the cell-size must depend on Rmax (otherwise, you would end up with a tree structure/cell hierarchy). I think @aphearin is splitting up cells more finely than Rmax. The entire efficiency of the code comes from the part that the particles are already gridded up on Rmax and the particles sorted according to their cell index. @aphearin how does halotools do the gridding?

aphearin commented 8 years ago

@surhudm I now allow the cell size to be only loosely connected to rmax, but I certainly do exploit knowledge of rmax when building the tree - this exploitation is the whole reason the algorithm developed by @manodeep is so fast. What I do is require that the cells into which sample1 are divided are at least as large as rmax, and furthermore that this cell size evenly divides Lbox (at least thrice). Sample2 cell sizes must evenly divide Sample1 cell sizes, but can in principle be arbitrarily small.

aphearin commented 8 years ago

@BenediktDiemer - I was using the term "marked correlation function" loosely, as Halotools implements a far more general version of this term than is used in the literature. The way our formalism works is that pair-counting proceeds as normal, but you can pass in an arbitrary mark or set of marks per point, and you choose a particular "marking function", aka wfunc. When a pair of points is found using the normal methods, rather than update the histogram with unity, instead the histogram is updated with the value of the marking function when it operates on the mark(s) of that pair. The advantage of this formalism is that the same optimized engine can be reused over and over again. For example, when I calculate the radial velocity dispersion of clusters, I do it using this generalized marking formalism, as shown here http://halotools.readthedocs.org/en/latest/quickstart_and_tutorials/tutorials/catalog_analysis/galcat_analysis/basic_examples/velocity_statistics/galaxy_catalog_analysis_tutorial7.html

In principle, this same formalism could work to count pairs in bins of r/Rhalo. The "mark" that would be passed in would be the value of Rhalo for sample1, and the histogram would only update based on the value of r/Rhalo. This is slightly different than what is currently implemented, because currently the marking functions do not have any dependence on r, only on the input marks. But the generalization would not be so hard. I can imagine doing it either way: using marks or using the iterator scheme suggested by @manodeep . The marking formalism will most certainly be faster, since the entire thing can be done in a pure cython layer with no recourse to yield. But the generator method is useful for a large variety of other things, and I intend to implement it anyway.

astropy / halotools

Halo-centric pair counter #472