cellgrid_self_distance_array doesn't calculate all distances

kain88-de commented 6 years ago

I wrote a wrapper around cellgrid to use sparse matrices for calculations of distance arrays. I only need self-distances and orientated myself using the distance functions provided already. The following version works and returns the same result as one would expect from the MDAnalysis functions cropped manually with np.where(dist < maxdist). Notice how in the second loop I need to use cell.all_neighbors instead of cell.half_neighbors as the current distance functions use.

def distances(a, maxdist, box):
    # estimate binsize of cellgrid for optimal performance
    binsize = .25 * (a.max() - a.min())
    cg = cellgrid.CellGrid(box, binsize, a)
    N = len(a)
    out = sparse.lil_matrix((N, N))

    for cell in cg:
        # check for particles in cell
        if len(cell) > 1:
            d = distance_array(cell.coordinates,
                               cell.coordinates,
                               box)
            # overwrite self distance
            d += 2 * maxdist * np.eye(len(d))
            i, j = np.where(d < maxdist)
            out[cell.indices[i], cell.indices[j]] = d[(i, j)]
        for addr in cell.all_neighbours:
            other = cg[addr]
            if not other:
                continue
            d = distance_array(cell.coordinates,
                               other.coordinates,
                               box)
            i, j = np.where(d < maxdist)
            out[cell.indices[i], other.indices[j]] = d[(i, j)]
    return out.tocoo()

I used the following code as a test

box = np.ones(3).astype(np.float32) * 100
a = np.random.uniform(low=0, high=box[0], size=(1000, 3))
a = np.array(a, order='F', dtype=np.float32)
maxdist = 1

d = distances(a, maxdist, box)
d.nnz

check = distance_array(a, a, box)
check += np.eye(len(a)) * 10
row, col = np.where(check < maxdist)

np.testing.assert_equal(row, d.row)
np.testing.assert_equal(col, d.col)
np.testing.assert_almost_equal(check[(col, row)], d.data)

kain88-de commented 6 years ago

This algorithm takes up 20MB of space in the out matrix when applied to a coordinate set of 60k atoms, for a cutoff of 4 Angström.

richardjgowers commented 6 years ago

@kain88-de I thought that if you do all_neighbours you'll end up with double counting, ie d[i, j] == d[j, i] ?

kain88-de commented 6 years ago

Yes, this is correct. But the all_neighbours option ends up counting all distances correctly. The other options doesn't

kain88-de commented 6 years ago

See the first part of the gist

richardjgowers commented 6 years ago

@kain88-de I think you're counting the number of distances wrong. In your gist, (len(row) - len(a)) // 2 will return the correct number of distances. You take off len(a) to remove self distances (0) and //2 to stop the double counting

kain88-de commented 6 years ago

How would this solve that len(row) is lower than the number of distances found by the cellgrid library? You can also see in the gist that distances larger than 10 A.U are returned, 10 was the maximum. According to the docs of the function only distances up to max_dist will be returned. This might be a separate issue though.

kain88-de commented 6 years ago

you are right the half_neighbor search finds all distances.

ayushsuhane commented 6 years ago

Hi @kain88-de , I am trying to understand the distance function you wrote, I had few questions about it. 1) Why have you used

d += 2 * maxdist * np.eye(len(d))

As far as I understand,

 for cell in cg:
        # check for particles in cell
        if len(cell) > 1:
            d = distance_array(cell.coordinates,
                               cell.coordinates,
                               box)

will give the distance between the coordinates within the cell, while distance of every atom with itself should be zero. However, adding an identity matrix to the initial d will correspond to self distance between particles as 2*maxdistance. 2) Is the Cellgrid by itself limited to monoclinic systems. I understand the object cell is a cubical subset of the box. Is it correct?

kain88-de commented 6 years ago

mistake of mine in the first iteration. I wanted to ensure that the diagonal doesn't pop up in the sparse matrix. It doesn't because the value is 0 anyway.
The grid can also be used for triclinic boxes. The goal is to have a coarse-grained mapping of all coordinates, this is independent of the periodicity. The grid is such a coarse-grained mapping that can be cheaply queried to know which atom pairs are close at all.

ayushsuhane commented 6 years ago

Thanks for the reply @kain88-de . I was wondering whether the idea is to always initialize the box with a larger dimensions than the original simulation box. Does the box size chooses xmin, ymin, zmin and xmax, ymax, zmax as the box size irrespective of the unit cell of the system?

kain88-de commented 6 years ago

The xmin/xmax don't match with the box numbers because the unit vectors of the box are tilted. You will have to calculate yourself a rectangular box shape the fully includes the triclinic cell.

MDAnalysis / cellgrid

cellgrid_self_distance_array doesn't calculate all distances #9