Current distance search API leads to suppar performance.

kain88-de commented 6 years ago

The current example suggest that the distance API accepts a maxdist keyword. This keyword is instead used as the number of bins per dimension used in the cellgrid. This can lead to very poor performance for small cutoffs.

For example we take a box of 1000 nm in size. Then a maxdist of 1 nm results in a cellgrid with 1000^3 cells. Generating this grid and iterating takes up to a minute. If we instead set the binsize of the grid to 100nm and apply the maxdist after we calculate all distances in a cell the cellgrid distance search is blazingly fast.

See #9 for an implementation of such an API

richardjgowers commented 6 years ago

Hey Max, yeah I think you're right that if you push it you'll spend longer building and using the data structures than you'll save. I don't really understand the changes you've made in #9, what is a?

I think your example is a little unfair - for real benchmarks, we'll have to determine some sort of representative particle density (or a range), then look at different typical search sizes, eg 2A for bonds, 5A for hbonds, 15A? for rdfs etc and see where performance sweet spots are

kain88-de commented 6 years ago

Performance test at the end of the gist

My proposed algorithm in #9 matches with the brute-force method implemented in MDAnalysis (except for the diagonal that is 0 and not counted in a sparse matrix). My algorithm version runs in ms the current implementation takes minutes!

The reason for this has nothing to do with the particle density. We both have the loop for cell in grid and the length of that loop determines the performances of the algorithm! In my case I set the grid size to be 5x5x5 totaling 125 cells in the grid. The current implementation has a grid size of 100x100x100 totaling 1 million cells in the grid! This is the same number of loop evaluation as the brute force method! This will be slow in python independent of the density of points.

Taking maxdist as a binsize for the grid is not a good choice. I needed to calculate contact_maps today and my system is hundreds of angstrom in size while i like to know which atoms are closer then 1 angstrom. The particles aren't distributed uniformly in the box but they occupy a large part of the volume.

I don't really understand the changes you've made in #9, what is a?

a is example data for a particular difficult case.

richardjgowers commented 6 years ago

The outermost loop over cells isn't optimised yet (I was checking that the algorithm was even promising), so it's not surprising that it's slow. I have also found myself that there's a maximum in the performance which sometimes means increasing cell sizes.

Taking maxdist as the binsize is the most aggressive approach for minimising the quantity of distances, but obviously there will be some sort of compromise between that and the cost of iterating over the extra cells which scale cubicly. Once we've optimised the loop over cells, we'll then know the relative performance of looping cells and distances (ie the approximate cost of evaluating one cycle in each). With this we'll be able to build some inbuilt heuristic of choosing a bin size based on the requested calculation.

I brought up particle density, because the number of distance calculations saved by binning is a function of this. Denser systems will benefit more from the domain decomposition. If we assume a homogeneous system, we can make a good estimate of how many distance calculations loops we remove by adding more cell loops. (and link this to the relative performance estimates to minimise overall cost)

I don't think it's an API problem - ultimately it should just work as "give me all distances up to X". It shouldn't require a user to have to know all the above to choose a bin size themselves.

So tl;dr, the issue you've raised here is actually that the package isn't finished and that there probably needs to be some sort of automatic performance tuning based on the distance search requested.

kain88-de commented 6 years ago

The package already works pretty good for me. I started using it in some analysis code to calculate distance matrices for 60000 atom systems relatively evenly distributed. The little adjustment to the API of having an optional parameter gridsize=10 to tune the grid gives me pretty good performance (10s for 60k atoms). Also because the performance will depend on the particle density we should expose this optional parameter to let the user tune the algorithm in case they really want/need to.

I agree that for the default case some heuristic should choose a good value. But I'm still for exposing this as an optional variable. I now have to calculate the contact maps a lot and the option to get the last bit of performance out of the algorithms is really beneficial.

MDAnalysis / cellgrid

Current distance search API leads to suppar performance. #11