VarIr / scikit-hubness

A Python package for hubness analysis and high-dimensional data mining
BSD 3-Clause "New" or "Revised" License
44 stars 9 forks source link

How to access skhubness.neighbors.kneighbors_graph in v0.30a? #112

Closed jolespin closed 6 months ago

jolespin commented 6 months ago

One of the features that was attractive for skhubness was the generalization of kneighbors_graph to use the approximated nearest neighbors WITH the ability to select mode=distance/connectivity AND include_self=True/False

https://scikit-hubness.readthedocs.io/en/latest/documentation/_autosummary/skhubness.neighbors.kneighbors_graph.html#skhubness.neighbors.kneighbors_graph

Is there anyway to access this functionality in the developmental version?

VarIr commented 6 months ago

skhubness v0.30 does not provide this convenience function. However, it should be relatively straight-forward to obtain any of these.

jolespin commented 6 months ago

The include_self feature is a pretty tricky problem and something I've actually been struggling with for a bit. More specifically, calculating the distance/connectivity w/ include_self=True from a distance calculated with include_self=False.

Here's an example:

from sklearn.datasets import make_classification
from sklearn.neighbors import kneighbors_graph, KNeighborsTransformer

X, _ = make_classification(n_samples=10, n_features=4, n_classes=2, n_clusters_per_class=1, random_state=0)
n_neighbors=3

# Nearest neighbors
nn_with_self = kneighbors_graph(X, n_neighbors=n_neighbors, mode="distance", metric="euclidean", include_self=True,n_jobs=-1).todense()
nn_without_self = kneighbors_graph(X, n_neighbors=n_neighbors, mode="distance", metric="euclidean", include_self=False,n_jobs=-1).todense()
nn_from_transformer = KNeighborsTransformer(mode="distance", n_neighbors=n_neighbors, metric="euclidean", n_jobs=-1).fit_transform(X)

np.all(nn_from_transformer == nn_without_self)
# True

np.all(nn_with_self == nn_without_self)
# False

# Is `nn_with_self` symmetric?
np.allclose(nn_with_self,nn_with_self.T)
# False

# Is `nn_without_self` symmetric?
np.allclose(nn_without_self,nn_without_self.T)
# False

If I can calculate nn_with_self from nn_without_self then I can generalize any of the transformers to produce the mode=connectivity, include_self=True functionality. Trying to integrate this approach into some manifold methods that use KNN calculations but require mode=connectivity, include_self=True. skhubness has few dependencies so it would be great choice to wrap in the backend.

I figured this one out but it might not be the most efficient:

nn_from_transformer_reconstructed = nn_from_transformer.copy()
for i,row in enumerate(nn_from_transformer):
    index_max = np.argmax(row)
    nn_from_transformer_reconstructed[i,index_max] = 0
np.allclose(nn_from_transformer_reconstructed, nn_with_self)
VarIr commented 6 months ago
np.all(nn_from_transformer == nn_without_self)
# True

This is misleading. nn_from_transformer does contain self as nearest neighbors, but at the same time gives you k+1 nearest neighbors (docs). So when you create a dense array from nn_without_self the implicit zeros of self distances become explicit, and all elements between the two arrays are equal. But still, nn_from_transformer has essentially include_self=True.

Similarly, the lower code box is problematic. The k-th nearest neighbor gets its distance set to zero, making it the actual closest neighbor. Only as a dense array is this then identical to nn_with_self. This doesn't really mean anything, because in the dense array, most elements are zero, and would therefore be considered nearest neighbors. It's important that the neighbors graph is a sparse matrix that only stores explicit values for distances to nearest neighbors. The moment this is cast to a dense array, explicit 0s of self distances and implicit 0s of non-neighbors become indistinguishable. Anyway, if I understand correctly, the solution might simply be the following to actually get rid of the one too many neighbors:

KNeighborsTransformer(..., n_neighbors=n_neighbors-1, ...)

Regarding the symmetry checks. Do you need the neighbors graph to be symmetric? It's quite possible that the nearest neighbor of a point next to the border itself has a different point as its nearest neighbor.

jolespin commented 6 months ago

I just looked into it and you're completely right. Even though the end result was the same, it's still very misleading and n_neighbors - 1 achieved the same thing I was doing manually post hoc. Thanks for your input. This will save me from a lot of wasted compute.

VarIr commented 6 months ago

Glad I could help