Closed jolespin closed 5 months ago
skhubness v0.30 does not provide this convenience function. However, it should be relatively straight-forward to obtain any of these.
ann = NMSlibTransformer.fit(X_index)
. Calling ann.transform(X_query)
gives you a kneighbors-graph, that is, a sparse scipy CSR matrix.distance
mode by default. Now I don't remember exactly but believe an sklearn kneighbors_graph(mode="connectivity")
simply replaces all distances with a 1.
, where the implicit zeros in the sparse matrix mean "not connected". If this is true, you can create a connectivity graph from the distance graph by doing graph.data[:] = 1.
X_index
in X_query
. Otherwise, if you cannot be sure, whether some indexed samples are in the query, you can again manipulate the CSR matrix, e.g. like so:
# Self distances could be non-zero due to numerical instabilities
graph.data[graph.data < 1e-6] = 0
# Get rid of all explicit zeros
graph.eliminate_zeros()
Of course, this also removes query samples that are very similar to indexed samples.
The include_self
feature is a pretty tricky problem and something I've actually been struggling with for a bit. More specifically, calculating the distance/connectivity w/ include_self=True
from a distance calculated with include_self=False
.
Here's an example:
from sklearn.datasets import make_classification
from sklearn.neighbors import kneighbors_graph, KNeighborsTransformer
X, _ = make_classification(n_samples=10, n_features=4, n_classes=2, n_clusters_per_class=1, random_state=0)
n_neighbors=3
# Nearest neighbors
nn_with_self = kneighbors_graph(X, n_neighbors=n_neighbors, mode="distance", metric="euclidean", include_self=True,n_jobs=-1).todense()
nn_without_self = kneighbors_graph(X, n_neighbors=n_neighbors, mode="distance", metric="euclidean", include_self=False,n_jobs=-1).todense()
nn_from_transformer = KNeighborsTransformer(mode="distance", n_neighbors=n_neighbors, metric="euclidean", n_jobs=-1).fit_transform(X)
np.all(nn_from_transformer == nn_without_self)
# True
np.all(nn_with_self == nn_without_self)
# False
# Is `nn_with_self` symmetric?
np.allclose(nn_with_self,nn_with_self.T)
# False
# Is `nn_without_self` symmetric?
np.allclose(nn_without_self,nn_without_self.T)
# False
If I can calculate nn_with_self
from nn_without_self
then I can generalize any of the transformers to produce the mode=connectivity, include_self=True
functionality. Trying to integrate this approach into some manifold methods that use KNN calculations but require mode=connectivity, include_self=True
. skhubness
has few dependencies so it would be great choice to wrap in the backend.
I figured this one out but it might not be the most efficient:
nn_from_transformer_reconstructed = nn_from_transformer.copy()
for i,row in enumerate(nn_from_transformer):
index_max = np.argmax(row)
nn_from_transformer_reconstructed[i,index_max] = 0
np.allclose(nn_from_transformer_reconstructed, nn_with_self)
np.all(nn_from_transformer == nn_without_self)
# True
This is misleading. nn_from_transformer
does contain self as nearest neighbors, but at the same time gives you k+1 nearest neighbors (docs). So when you create a dense array from nn_without_self
the implicit zeros of self distances become explicit, and all elements between the two arrays are equal. But still, nn_from_transformer
has essentially include_self=True
.
Similarly, the lower code box is problematic. The k-th nearest neighbor gets its distance set to zero, making it the actual closest neighbor. Only as a dense array is this then identical to nn_with_self
. This doesn't really mean anything, because in the dense array, most elements are zero, and would therefore be considered nearest neighbors. It's important that the neighbors graph is a sparse matrix that only stores explicit values for distances to nearest neighbors. The moment this is cast to a dense array, explicit 0s of self distances and implicit 0s of non-neighbors become indistinguishable.
Anyway, if I understand correctly, the solution might simply be the following to actually get rid of the one too many neighbors:
KNeighborsTransformer(..., n_neighbors=n_neighbors-1, ...)
Regarding the symmetry checks. Do you need the neighbors graph to be symmetric? It's quite possible that the nearest neighbor of a point next to the border itself has a different point as its nearest neighbor.
I just looked into it and you're completely right. Even though the end result was the same, it's still very misleading and n_neighbors - 1
achieved the same thing I was doing manually post hoc. Thanks for your input. This will save me from a lot of wasted compute.
Glad I could help
One of the features that was attractive for
skhubness
was the generalization ofkneighbors_graph
to use the approximated nearest neighbors WITH the ability to selectmode=distance/connectivity
ANDinclude_self=True/False
https://scikit-hubness.readthedocs.io/en/latest/documentation/_autosummary/skhubness.neighbors.kneighbors_graph.html#skhubness.neighbors.kneighbors_graph
Is there anyway to access this functionality in the developmental version?