facebookresearch / pysparnn

Approximate Nearest Neighbor Search for Sparse Data in Python!
Other
916 stars 145 forks source link

Implementing jaccard distance #29

Open ljmartin opened 4 years ago

ljmartin commented 4 years ago

Doesn't look like this is maintained, but you can add the below to matrix_distance.py to implement jaccard distance. I've tested it and it works for my own dataset but please don't use without a sanity check first! i.e. test a couple of points and compare to brute force similarity search to make sure the distances align.

usage:

from pysparnn.matrix_distance import JaccardDistance
mySparseMatrix = sparse.csr_matrix(blah)
myIndices = np.arange(mySparseMatrix.shape[0])

cp = ci.MultiClusterIndex(mySparseMatrix, myIndices, distance_type=JaccardDistance)
cp.search(mySparseMatrix[:10], k=5, return_distance=True)

to add to matrix_distance.py:

class JaccardDistance(MatrixMetricSearch):
    """A matrix that implements jaccard distance search against it.

    jaccard_distance = 1 - jaccard_similarity

    Note: We want items that are more similar to be closer to zero so we are
    going to instead return 1 - jaccard_similarity. We do this so similarity
    and distance metrics can be treated the same way.
    """

    def __init__(self, features, records_data):
        super(JaccardDistance, self).__init__(features, records_data)

    @staticmethod
    def features_to_matrix(features):
        """
        Args:
            val: A list of features to be formatted.
        Returns:
            The transformed matrix.
        """
        return _sparse.csr_matrix(features)

    @staticmethod
    def vstack(matrix_list):
        """
        Args:
            val: A list of features to be formatted.
        Returns:
            The transformed matrix.
        """
        return _sparse.vstack(matrix_list)

    def _transform_value(self, v):
        return v

    def _distance(self, a_matrix):
        """Vectorised sparse jaccard distance"""
        X = _sparse.csr_matrix(a_matrix).astype(int)
        Y = _sparse.csr_matrix(self.matrix).astype(int)
        intersect = X.dot(Y.T)

        x_sum = X.sum(axis=1).A1
        y_sum = Y.sum(axis=1).A1
        xx, yy = _np.meshgrid(x_sum, y_sum)
        union = ((xx + yy).T - intersect)

        return (1 - intersect / union).A