How to access skhubness.neighbors.kneighbors_graph in v0.30a?

jolespin commented 6 months ago

One of the features that was attractive for skhubness was the generalization of kneighbors_graph to use the approximated nearest neighbors WITH the ability to select mode=distance/connectivity AND include_self=True/False

https://scikit-hubness.readthedocs.io/en/latest/documentation/_autosummary/skhubness.neighbors.kneighbors_graph.html#skhubness.neighbors.kneighbors_graph

Is there anyway to access this functionality in the developmental version?

VarIr commented 6 months ago

skhubness v0.30 does not provide this convenience function. However, it should be relatively straight-forward to obtain any of these.

ANN: Let's consider, for example, ann = NMSlibTransformer.fit(X_index). Calling ann.transform(X_query) gives you a kneighbors-graph, that is, a sparse scipy CSR matrix.
mode: This graph will be in distance mode by default. Now I don't remember exactly but believe an sklearn kneighbors_graph(mode="connectivity") simply replaces all distances with a 1., where the implicit zeros in the sparse matrix mean "not connected". If this is true, you can create a connectivity graph from the distance graph by doing graph.data[:] = 1.
include self: This might be tricky. If it is possible in your use case, control this by providing or not providing samples from X_index in X_query. Otherwise, if you cannot be sure, whether some indexed samples are in the query, you can again manipulate the CSR matrix, e.g. like so:
```
# Self distances could be non-zero due to numerical instabilities
graph.data[graph.data < 1e-6] = 0
# Get rid of all explicit zeros
graph.eliminate_zeros()
```
Of course, this also removes query samples that are very similar to indexed samples.

jolespin commented 6 months ago

The include_self feature is a pretty tricky problem and something I've actually been struggling with for a bit. More specifically, calculating the distance/connectivity w/ include_self=True from a distance calculated with include_self=False.

Here's an example:

from sklearn.datasets import make_classification
from sklearn.neighbors import kneighbors_graph, KNeighborsTransformer

X, _ = make_classification(n_samples=10, n_features=4, n_classes=2, n_clusters_per_class=1, random_state=0)
n_neighbors=3

# Nearest neighbors
nn_with_self = kneighbors_graph(X, n_neighbors=n_neighbors, mode="distance", metric="euclidean", include_self=True,n_jobs=-1).todense()
nn_without_self = kneighbors_graph(X, n_neighbors=n_neighbors, mode="distance", metric="euclidean", include_self=False,n_jobs=-1).todense()
nn_from_transformer = KNeighborsTransformer(mode="distance", n_neighbors=n_neighbors, metric="euclidean", n_jobs=-1).fit_transform(X)

np.all(nn_from_transformer == nn_without_self)
# True

np.all(nn_with_self == nn_without_self)
# False

# Is `nn_with_self` symmetric?
np.allclose(nn_with_self,nn_with_self.T)
# False

# Is `nn_without_self` symmetric?
np.allclose(nn_without_self,nn_without_self.T)
# False

If I can calculate nn_with_self from nn_without_self then I can generalize any of the transformers to produce the mode=connectivity, include_self=True functionality. Trying to integrate this approach into some manifold methods that use KNN calculations but require mode=connectivity, include_self=True. skhubness has few dependencies so it would be great choice to wrap in the backend.

I figured this one out but it might not be the most efficient:

nn_from_transformer_reconstructed = nn_from_transformer.copy()
for i,row in enumerate(nn_from_transformer):
    index_max = np.argmax(row)
    nn_from_transformer_reconstructed[i,index_max] = 0
np.allclose(nn_from_transformer_reconstructed, nn_with_self)

VarIr commented 6 months ago

np.all(nn_from_transformer == nn_without_self)
# True

This is misleading. nn_from_transformer does contain self as nearest neighbors, but at the same time gives you k+1 nearest neighbors (docs). So when you create a dense array from nn_without_self the implicit zeros of self distances become explicit, and all elements between the two arrays are equal. But still, nn_from_transformer has essentially include_self=True.

Similarly, the lower code box is problematic. The k-th nearest neighbor gets its distance set to zero, making it the actual closest neighbor. Only as a dense array is this then identical to nn_with_self. This doesn't really mean anything, because in the dense array, most elements are zero, and would therefore be considered nearest neighbors. It's important that the neighbors graph is a sparse matrix that only stores explicit values for distances to nearest neighbors. The moment this is cast to a dense array, explicit 0s of self distances and implicit 0s of non-neighbors become indistinguishable. Anyway, if I understand correctly, the solution might simply be the following to actually get rid of the one too many neighbors:

KNeighborsTransformer(..., n_neighbors=n_neighbors-1, ...)

Regarding the symmetry checks. Do you need the neighbors graph to be symmetric? It's quite possible that the nearest neighbor of a point next to the border itself has a different point as its nearest neighbor.

jolespin commented 6 months ago

I just looked into it and you're completely right. Even though the end result was the same, it's still very misleading and n_neighbors - 1 achieved the same thing I was doing manually post hoc. Thanks for your input. This will save me from a lot of wasted compute.

VarIr commented 6 months ago

Glad I could help

VarIr / scikit-hubness

How to access skhubness.neighbors.kneighbors_graph in v0.30a? #112