facebookresearch / pysparnn

Approximate Nearest Neighbor Search for Sparse Data in Python!
Other
918 stars 145 forks source link

Callable metric function for custom distances #11

Closed tejasshah93 closed 7 years ago

tejasshah93 commented 7 years ago

Hi,

PySparNN looks great! I'd like to know if there is a provision for using custom distances as a metric while searching for nearest neighbors.

E.g. For sklearn.neighbors.NearestNeighbors, the documentation mentions:

metric : string or callable If metric is a callable function, it is called on each pair of instances (rows) and the resulting value recorded. The callable should take two arrays as input and return one value indicating the distance between them.

Do we have such a provision for custom distances in PySparNN? Thanks.

spencebeecher commented 7 years ago

Hi @tejasshah93 - Great question!

There isnt anything custom like that out of the box. However it is pretty easy to implement your own custom distance metrics. By default, pysparnn uses sparse matricies + cosine similarity. This is where the library really shines. I am not aware of many python based libraries that do sparse search accurately and efficiently (https://github.com/ekzhu/datasketch and https://github.com/src-d/minhashcuda both look good).

However here is an example (already in the code) of using an alternative distance. https://github.com/facebookresearch/pysparnn/blob/master/examples/dense_matrix.ipynb

Here is an example (that should mostly work) that will let you use custom distances. scipy.spatial.distance.cdist provides a variety of distance types. Caution - I have found these distance functions to be much slower than what is currently in pysparnn. This is probably reasonable for testing but writing customized & efficient code is probably the best way forward. More examples of custom distances in pysparnn can be found here: https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py

class SlowEuclideanDistance(MatrixMetricSearch):
    def __init__(self, records_features, records_data):
        super(SlowEuclideanDistance, self).__init__(records_features, 
                                                    records_data, 
                                                    is_similarity=False)
        self.matrix = self.matrix.toarray()

    @staticmethod
    def _transform_value(self, v):
        return v

    @staticmethod
    def features_to_matrix(features):
        """
        Args:
            val: A list of features to be formatted.
        Returns:
            The transformed matrix.
        """
        return np.array(features, ndmin=2)

    @staticmethod
    def vstack(matrix_list):
        """
        Args:
            val: A list of features to be formatted.
        Returns:
            The transformed matrix.
        """
        return np.vstack(matrix_list)

    def _transform_value(self, v):
        return v
    def _similarity(self, a_matrix):
        """Euclidean distance"""
        # need to handle fipping argmin k to positive
        return scipy.sparse.csr_matrix(scipy.spatial.distance.cdist(
                a_matrix.toarray(), 
                self.matrix, 'euclidean'))
spencebeecher commented 7 years ago

https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py#L249

Any updates on this? If I dont hear back from you in a week I am going to close this out!

tejasshah93 commented 7 years ago

Hi @spencebeecher - Apologies for not getting back earlier. Missed the notification! Thanks for the prompt and detailed response for the issue raised along with the other insights!

From what I understood, in order to define my own custom distance metric in pysparnn, I have to define a class UserCustomDistance (say), it's constructor and methods as mentioned in the snippet above / here viz.,
_transform_value, features_to_matrix, vstack and
_distance: which basically defines the distance metric to be operated over the matrices.

Having written the above class and it's methods, simply using distance_type=pysparnn.matrix_distance.UserCustomDistance while calculating the nearest neighbors would do the job. Sounds right?

Let me know if I'm not headed in the right direction. Else, you may close the issue for now. And I'll reopen it for clarifications later while implementation, if any. Thanks!

spencebeecher commented 7 years ago

That is totally it! Please let me know if you run into any issues or have suggestions for improvement! Thanks!

tejasshah93 commented 7 years ago

Sure thing! :+1:

tejasshah93 commented 7 years ago

Hi @spencebeecher,

As discussed, I tried implementing a custom distance class and it's allied methods with using sklearn.metrics.pairwise.pairwise_distances in _distance() and a callable metric function as an argument to it.

The training dataset is a sparse matrix of size NxN (value of N is around 900,000) (Machine configuration: 64 GB RAM, 12 cores)

However, it seems that the index creation itself is taking a bit long to complete (4 hours and still running). Had a brief look at cluster_index.py but couldn't find if there's a parameter to specify the number of workers/jobs for index creation.

Is this an expected behavior while using sklearn.metrics.pairwise.pairwise_distances (as you probably mentioned earlier)? Or if you could let me know the time complexity for indexing with respect to size of the matrix.

Thanks!

spencebeecher commented 7 years ago

Hi! Could you share a copy of the code? I can try to reproduce on my end. Are you able to get it to work with a different distance measure?

On Mar 22, 2017 3:46 PM, "Tejas Shah" notifications@github.com wrote:

Hi @spencebeecher https://github.com/spencebeecher,

As discussed, I tried implementing a custom distance class and it's allied methods with using sklearn.metrics.pairwise.pairwise_distances http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html in _distance() and a callable metric function as an argument to it.

The training dataset is a sparse matrix of size NxN (value of N is around 900,000) (Machine configuration: 64 GB RAM, 12 cores)

However, it seems that the index creation itself is taking a bit long to complete (4 hours and still running). Had a brief look at cluster_index.py https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/cluster_index.py but couldn't find if there's a parameter to specify the number of workers/jobs for index creation.

Is this an expected behavior while using sklearn.metrics.pairwise. pairwise_distances (as you probably mentioned earlier)? Or if you could let me know the time complexity for indexing with respect to the size of the matrix.

Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/pysparnn/issues/11#issuecomment-288563213, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXLXSqJ-XamPM1UDSN-r9bOXiww_aiIks5roaS6gaJpZM4MZQu1 .

tejasshah93 commented 7 years ago

Hi @spencebeecher

Indeed. I ran the code against a small dummy dataset for validating the nearest neighbors and the custom distance values - it seems to be working as expected! This gist represents the custom distance class written (modified a value for testing purpose). After appending this class to matrix_distance.py, updated the bindings by executing $ python setup.py install --user from repository folder.

For reproducing the experiment on your end, you can create a random sparse matrix of NxN (N=900000) and run

>>> cp = ci.MultiClusterIndex(dataset, data_to_return, distance_type=pysparnn.matrix_distance.UserCustomDistance)

where data_to_return = range(dataset.shape[0])

Let me know if anything doesn't seem fine or otherwise. Thanks!

spencebeecher commented 7 years ago

Thanks for the detailed reply​ - I'll have a look tonight and get back to you. I suspect that the sklearn distance metric may not be as performant as we would like. I might be able to write something custom.

Thanks again!

On Mar 22, 2017 6:10 PM, "Tejas Shah" notifications@github.com wrote:

Hi @spencebeecher https://github.com/spencebeecher

Indeed. I ran the code against a small dummy dataset for validating the nearest neighbors and the custom distance values - it seems to be working as expected! This gist https://gist.github.com/tejasshah93/8dc7dcbc7463f36e171d5db66355ff87 represents the custom distance class written (modified a value for testing purpose). After appending this class to matrix_distance.py https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py, updated the bindings by executing $ python setup.py install --user from repository folder.

For reproducing the experiment on your end, you can create a random sparse matrix of NxN (N=900000) and run

cp = ci.MultiClusterIndex(dataset, data_to_return, distance_type=pysparnn.matrix_distance.UserCustomDistance)

where data_to_return = range(dataset.shape[0])

Let me know if anything doesn't seem fine or otherwise. Thanks!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/pysparnn/issues/11#issuecomment-288587588, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXLXfMfUN8Spd0Otba2sW3Ku8E0Euiyks5rocaRgaJpZM4MZQu1 .

tejasshah93 commented 7 years ago

Appreciate your prompt response @spencebeecher! For diagnosis, I'd like to let you know that the index creation step isn't completed yet (> 24 hrs and still running). So yes, maybe your suspicion regarding sklearn/scipy distance metric performance is right.

Sure, let me know when you're possibly done with the implementation (or notify me if I've erred somewhere while writing the distance class).

Thanks!

spencebeecher commented 7 years ago

My sln may be faster - i tested the new impl on sample data and was able to get the same results. I am currently running this on my laptop which does not have much power. Ill let you know how log it took to index. Note that i did make the matrix rather sparse (only 10 elements per record on average)

tejasshah93 commented 7 years ago

Hi,

I'm not entirely sure what you're attempting when defining the distance metric here. Guess that's why the results are different for the sample matrices mentioned in the gist. (how did you get the same results?)

In [6]: u_dist = UserCustomDistance(A, range(A.shape[0]))
    ...: u_dist._distance(A)
Out[6]: 
array([[-2.,  1.,  0.],
       [ 1.,  1.,  1.],
       [ 0.,  1.,  0.]])

In [7]: su_dist = SpencerUserCustomDistance(A, range(A.shape[0]))
    ...: su_dist._distance(A)
Out[7]: 
array([[-4,  2,  0],
       [-1,  0,  0],
       [-2,  1,  0]])

For clarifications, the custom distance value is computed as follows: def user_distance_metric(): np.minimum over the data of two sparse vectors to be compared. And then subtracting sum of the resultant array from self.max_overlap


However, even when I proceeded with creating the index as mentioned in the gist, it aborted with the following error:

. .
/home/tejas/.local/lib/python2.7/site-packages/pysparnn/cluster_index.pyc in __init__(self, features, records_data, distance_type, matrix_size, parent)
    135                 max_rng = min(rng + rng_step, features.shape[0])
    136                 records_rng = features[rng:max_rng]
--> 137                 for i, clstrs in enumerate(root.nearest_search(records_rng, k=1)):
    138                     for _, cluster in clstrs:
    139                         item_to_clusters[cluster].append(i + rng)

/home/tejas/.local/lib/python2.7/site-packages/pysparnn/matrix_distance.pyc in nearest_search(self, features, k, max_distance)
    102         """
    103 
--> 104         dist_matrix = self._distance(features)
    105 
    106         if max_distance is None:

<ipython-input-2-9ac6b3897d5b> in _distance(self, a_matrix)
     69 
     70     def _distance(self, a_matrix):
---> 71         return np.array(self.matrix.sum(axis=1) - self.matrix.dot(a_matrix.transpose()).transpose())
     72 
     73 

/home/tejas/.local/lib/python2.7/site-packages/scipy/sparse/base.pyc in __mul__(self, other)
    350         if issparse(other):
    351             if self.shape[1] != other.shape[0]:
--> 352                 raise ValueError('dimension mismatch')
    353             return self._mul_sparse_matrix(other)
    354 

ValueError: dimension mismatch

Any headers?

spencebeecher commented 7 years ago

Whoopse - you are totally right! I had a bug that i must have introduced. Try this code. I haven only tried it on 100k records not 1mil.

https://gist.github.com/spencebeecher/04e79631e9725ea25889a53117319412

(same as the gist just with notebook output) tej_notebook.pdf

tejasshah93 commented 7 years ago

Hi @spencebeecher ,

I tried the above mentioned modified code for 1 million records and it executed in just about 225 seconds! (where matrix.sum(axis=1).mean() returned 99.997115170633236). That's really fast as compared!

As you mentioned, the results are same for sample matrices this time. However, when I checked the output for sample matrices from the previous gist, it's giving different results: I'm not sure of the approach used in defining the custom distance in SpencerUserCustomDistance. Maybe have a glance at it once?

Thanks!

spencebeecher commented 7 years ago

It was a bug (kinda)!

I had 'max_overlap' specified as unique per record matrix.sum(axis=1). This means that the max_distance would vary by record in the index. Setting max_overlap overlap to matrix.shape[0] fixed the issue.

Old and new