Closed tejasshah93 closed 7 years ago
Hi @tejasshah93 - Great question!
There isnt anything custom like that out of the box. However it is pretty easy to implement your own custom distance metrics. By default, pysparnn uses sparse matricies + cosine similarity. This is where the library really shines. I am not aware of many python based libraries that do sparse search accurately and efficiently (https://github.com/ekzhu/datasketch and https://github.com/src-d/minhashcuda both look good).
However here is an example (already in the code) of using an alternative distance. https://github.com/facebookresearch/pysparnn/blob/master/examples/dense_matrix.ipynb
Here is an example (that should mostly work) that will let you use custom distances. scipy.spatial.distance.cdist provides a variety of distance types. Caution - I have found these distance functions to be much slower than what is currently in pysparnn. This is probably reasonable for testing but writing customized & efficient code is probably the best way forward. More examples of custom distances in pysparnn can be found here: https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py
class SlowEuclideanDistance(MatrixMetricSearch):
def __init__(self, records_features, records_data):
super(SlowEuclideanDistance, self).__init__(records_features,
records_data,
is_similarity=False)
self.matrix = self.matrix.toarray()
@staticmethod
def _transform_value(self, v):
return v
@staticmethod
def features_to_matrix(features):
"""
Args:
val: A list of features to be formatted.
Returns:
The transformed matrix.
"""
return np.array(features, ndmin=2)
@staticmethod
def vstack(matrix_list):
"""
Args:
val: A list of features to be formatted.
Returns:
The transformed matrix.
"""
return np.vstack(matrix_list)
def _transform_value(self, v):
return v
def _similarity(self, a_matrix):
"""Euclidean distance"""
# need to handle fipping argmin k to positive
return scipy.sparse.csr_matrix(scipy.spatial.distance.cdist(
a_matrix.toarray(),
self.matrix, 'euclidean'))
https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py#L249
Any updates on this? If I dont hear back from you in a week I am going to close this out!
Hi @spencebeecher - Apologies for not getting back earlier. Missed the notification! Thanks for the prompt and detailed response for the issue raised along with the other insights!
From what I understood, in order to define my own custom distance metric in pysparnn, I have to define a class UserCustomDistance
(say), it's constructor and methods as mentioned in the snippet above / here viz.,
_transform_value
, features_to_matrix
, vstack
and
_distance
: which basically defines the distance metric to be operated over the matrices.
Having written the above class and it's methods, simply using distance_type=pysparnn.matrix_distance.UserCustomDistance
while calculating the nearest neighbors would do the job. Sounds right?
Let me know if I'm not headed in the right direction. Else, you may close the issue for now. And I'll reopen it for clarifications later while implementation, if any. Thanks!
That is totally it! Please let me know if you run into any issues or have suggestions for improvement! Thanks!
Sure thing! :+1:
Hi @spencebeecher,
As discussed, I tried implementing a custom distance class and it's allied methods with using sklearn.metrics.pairwise.pairwise_distances in _distance()
and a callable metric function as an argument to it.
The training dataset is a sparse matrix of size NxN (value of N is around 900,000) (Machine configuration: 64 GB RAM, 12 cores)
However, it seems that the index creation itself is taking a bit long to complete (4 hours and still running). Had a brief look at cluster_index.py but couldn't find if there's a parameter to specify the number of workers/jobs for index creation.
Is this an expected behavior while using sklearn.metrics.pairwise.pairwise_distances
(as you probably mentioned earlier)? Or if you could let me know the time complexity for indexing with respect to size of the matrix.
Thanks!
Hi! Could you share a copy of the code? I can try to reproduce on my end. Are you able to get it to work with a different distance measure?
On Mar 22, 2017 3:46 PM, "Tejas Shah" notifications@github.com wrote:
Hi @spencebeecher https://github.com/spencebeecher,
As discussed, I tried implementing a custom distance class and it's allied methods with using sklearn.metrics.pairwise.pairwise_distances http://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.pairwise_distances.html in _distance() and a callable metric function as an argument to it.
The training dataset is a sparse matrix of size NxN (value of N is around 900,000) (Machine configuration: 64 GB RAM, 12 cores)
However, it seems that the index creation itself is taking a bit long to complete (4 hours and still running). Had a brief look at cluster_index.py https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/cluster_index.py but couldn't find if there's a parameter to specify the number of workers/jobs for index creation.
Is this an expected behavior while using sklearn.metrics.pairwise. pairwise_distances (as you probably mentioned earlier)? Or if you could let me know the time complexity for indexing with respect to the size of the matrix.
Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/pysparnn/issues/11#issuecomment-288563213, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXLXSqJ-XamPM1UDSN-r9bOXiww_aiIks5roaS6gaJpZM4MZQu1 .
Hi @spencebeecher
Indeed. I ran the code against a small dummy dataset for validating the nearest neighbors and the custom distance values - it seems to be working as expected! This gist represents the custom distance class written (modified a value for testing purpose). After appending this class to matrix_distance.py, updated the bindings by executing $ python setup.py install --user
from repository folder.
For reproducing the experiment on your end, you can create a random sparse matrix of NxN (N=900000) and run
>>> cp = ci.MultiClusterIndex(dataset, data_to_return, distance_type=pysparnn.matrix_distance.UserCustomDistance)
where data_to_return = range(dataset.shape[0])
Let me know if anything doesn't seem fine or otherwise. Thanks!
Thanks for the detailed reply - I'll have a look tonight and get back to you. I suspect that the sklearn distance metric may not be as performant as we would like. I might be able to write something custom.
Thanks again!
On Mar 22, 2017 6:10 PM, "Tejas Shah" notifications@github.com wrote:
Hi @spencebeecher https://github.com/spencebeecher
Indeed. I ran the code against a small dummy dataset for validating the nearest neighbors and the custom distance values - it seems to be working as expected! This gist https://gist.github.com/tejasshah93/8dc7dcbc7463f36e171d5db66355ff87 represents the custom distance class written (modified a value for testing purpose). After appending this class to matrix_distance.py https://github.com/facebookresearch/pysparnn/blob/master/pysparnn/matrix_distance.py, updated the bindings by executing $ python setup.py install --user from repository folder.
For reproducing the experiment on your end, you can create a random sparse matrix of NxN (N=900000) and run
cp = ci.MultiClusterIndex(dataset, data_to_return, distance_type=pysparnn.matrix_distance.UserCustomDistance)
where data_to_return = range(dataset.shape[0])
Let me know if anything doesn't seem fine or otherwise. Thanks!
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/facebookresearch/pysparnn/issues/11#issuecomment-288587588, or mute the thread https://github.com/notifications/unsubscribe-auth/AAXLXfMfUN8Spd0Otba2sW3Ku8E0Euiyks5rocaRgaJpZM4MZQu1 .
Appreciate your prompt response @spencebeecher! For diagnosis, I'd like to let you know that the index creation step isn't completed yet (> 24 hrs and still running). So yes, maybe your suspicion regarding sklearn/scipy distance metric performance is right.
Sure, let me know when you're possibly done with the implementation (or notify me if I've erred somewhere while writing the distance class).
Thanks!
My sln may be faster - i tested the new impl on sample data and was able to get the same results. I am currently running this on my laptop which does not have much power. Ill let you know how log it took to index. Note that i did make the matrix rather sparse (only 10 elements per record on average)
Hi,
I'm not entirely sure what you're attempting when defining the distance metric here. Guess that's why the results are different for the sample matrices mentioned in the gist. (how did you get the same results?)
In [6]: u_dist = UserCustomDistance(A, range(A.shape[0]))
...: u_dist._distance(A)
Out[6]:
array([[-2., 1., 0.],
[ 1., 1., 1.],
[ 0., 1., 0.]])
In [7]: su_dist = SpencerUserCustomDistance(A, range(A.shape[0]))
...: su_dist._distance(A)
Out[7]:
array([[-4, 2, 0],
[-1, 0, 0],
[-2, 1, 0]])
For clarifications, the custom distance value is computed as follows:
def user_distance_metric()
: np.minimum over the data of two sparse vectors to be compared. And then subtracting sum of the resultant array from self.max_overlap
However, even when I proceeded with creating the index as mentioned in the gist, it aborted with the following error:
. .
/home/tejas/.local/lib/python2.7/site-packages/pysparnn/cluster_index.pyc in __init__(self, features, records_data, distance_type, matrix_size, parent)
135 max_rng = min(rng + rng_step, features.shape[0])
136 records_rng = features[rng:max_rng]
--> 137 for i, clstrs in enumerate(root.nearest_search(records_rng, k=1)):
138 for _, cluster in clstrs:
139 item_to_clusters[cluster].append(i + rng)
/home/tejas/.local/lib/python2.7/site-packages/pysparnn/matrix_distance.pyc in nearest_search(self, features, k, max_distance)
102 """
103
--> 104 dist_matrix = self._distance(features)
105
106 if max_distance is None:
<ipython-input-2-9ac6b3897d5b> in _distance(self, a_matrix)
69
70 def _distance(self, a_matrix):
---> 71 return np.array(self.matrix.sum(axis=1) - self.matrix.dot(a_matrix.transpose()).transpose())
72
73
/home/tejas/.local/lib/python2.7/site-packages/scipy/sparse/base.pyc in __mul__(self, other)
350 if issparse(other):
351 if self.shape[1] != other.shape[0]:
--> 352 raise ValueError('dimension mismatch')
353 return self._mul_sparse_matrix(other)
354
ValueError: dimension mismatch
Any headers?
Whoopse - you are totally right! I had a bug that i must have introduced. Try this code. I haven only tried it on 100k records not 1mil.
https://gist.github.com/spencebeecher/04e79631e9725ea25889a53117319412
(same as the gist just with notebook output) tej_notebook.pdf
Hi @spencebeecher ,
I tried the above mentioned modified code for 1 million records and it executed in just about 225 seconds! (where matrix.sum(axis=1).mean()
returned 99.997115170633236). That's really fast as compared!
As you mentioned, the results are same for sample matrices this time. However, when I checked the output for sample matrices from the previous gist, it's giving different results: I'm not sure of the approach used in defining the custom distance in SpencerUserCustomDistance
. Maybe have a glance at it once?
Thanks!
Hi,
PySparNN looks great! I'd like to know if there is a provision for using custom distances as a metric while searching for nearest neighbors.
E.g. For sklearn.neighbors.NearestNeighbors, the documentation mentions:
Do we have such a provision for custom distances in PySparNN? Thanks.