bccp / nbodykit

Analysis kit for large-scale structure datasets, the massively parallel way
http://nbodykit.rtfd.io
GNU General Public License v3.0
111 stars 60 forks source link

domain decomposition on (ra, dec, z) surveys leads to unbalanced loads #364

Closed nickhand closed 7 years ago

nickhand commented 7 years ago

When you convert (ra, dec, z) to Cartesian coordinates, the particles aren't uniformly distributed in the resulting box. The consequence is that when you do the normal grid domain decomposition in SurveyDataPairCount, a lot of ranks have zero or very few particles and the load is unbalanced and a few ranks are responsible for most of the work.

@rainwoodman do you see a potential fix for this, or maybe we just want to remove the domain decomposition?

rainwoodman commented 7 years ago

Does domain decomposition make it slower? I prefer keeping the code formally parallel since there are non-trivial thoughts that goes into it.

We can come up with finer domain decomposition schemes. For example, we can reassign ranks to spatial regions differently and skip empty regions.

nickhand commented 7 years ago

Yes, I agree that's probably what we should do. That functionality doesn't exist right? I am not that familiar with all of the features of the GridND object, etc.

And yep, the code is significantly slower in this case b/c the distribution of ranks is quite bad. In the pair count algorithm for (ra, dec, z) input, we decompose the position array that's already on the rank (with smoothing=0) and then correlate against the decomposed position (smoothed by r-max). The problem is the number density is very non-constant in the Cartesian box because of the (ra, dec,z) --> (x,y,z) transformation so you end up with a lot of ranks doing very little (or no) work, while one or two ranks do most of the work.

I think we just need to ensure that the number of objects on each rank after that first decomposition (smoothing=0) is relatively even still.

nickhand commented 7 years ago

Initial tests seem to show that rainwoodman/pmesh#26 fixes this, although you do need to over-decompose the domain grid in order for the load balance to do anything.

Some implementation questions for @rainwoodman. In the past, I think the design we have been using is:

  1. domain decompose pos1 and exchange with smoothing R=0
  2. domain decompose pos2 and exchange with smoothing R=max(bins) 3 cross-correlation pos1 with pos2

Is step 1 necessary? The pos1 array is already evenly distributed before step 1, and may not be after the exchange. I guess I am wondering why any particles are exchanged at all when the smoothing = 0?

rainwoodman commented 7 years ago

It is evenly distributed but it is not spatially tight -- then it is impossible to ensure the volume in 2 encloses all particles residing on the current process.

rainwoodman commented 7 years ago

So I think you'll have to do the loadbalance with 1, then use the domain assignment of step 1 for step 2.