VarIr / scikit-hubness

A Python package for hubness analysis and high-dimensional data mining
BSD 3-Clause "New" or "Revised" License
44 stars 9 forks source link

On the use of 'precomputed' to estimate hubness #74

Closed ivan-marroquin closed 2 years ago

ivan-marroquin commented 2 years ago

Hi,

I would like to use the option 'precomputed' to estimate hubness in my data set. Below you will find a portion of my Python code:

n_neighbors= 15

tree= NNDescent(input_data, metric= 'minkowski', metric_kwds= {'p': 0.3}, n_neighbors= n_neighbors, random_state= 1969, n_jobs= cpu_count) tree.prepare()

add 1 to neighbors to skip latter the first point which refers to the sample itself

neighbors= np.zeros((input_data.shape[0], n_neighbors), dtype= np.int32)

neig= tree.query(input_data, k= n_neighbors + 1)

copy only the (n query x n index)

neighbors= neig[0][:,1:].copy()

hub= Hubness(k= n_neighbors, return_value= 'all', metric= 'precomputed', algorithm= 'brute', hubness= None, random_state= 1969, n_jobs= cpu_count) hub.fit(neighbors)

I get the following error message: File "C:\Temp\Python\Python3.6.5\lib\site-packages\scikit_hubness-0.21.3-py3.6.egg\skhubness\analysis\estimation.py", line 618, in score k_neighbors = self._k_neighbors_precomputed(X_test, kth, start, end) File "C:\Temp\Python\Python3.6.5\lib\site-packages\scikit_hubness-0.21.3-py3.6.egg\skhubness\analysis\estimation.py", line 354, in _k_neighbors_precomputed d[~np.isfinite(d)] = np.inf OverflowError: cannot convert float infinity to integer

Any suggestions?

Thanks, Ivan

VarIr commented 2 years ago

Hi Ivan,

I'm not so familiar with PyNNDescent, but it seems tree.query() returns an (indices, distances)-tuple. For hub.fit() you need to pass the distance matrix, not the indices. The following might work:

neigh, dist = tree.query(
    input_data,
    k=n_neighbors + 1,
)
hub=Hubness(
    k=n_neighbors,
    return_value='all',
    metric='precomputed',
    algorithm='brute',
    hubness=None,
    random_state=1969,
    n_jobs= cpu_count,
)
hub.fit(dist)
result = hub.score(has_self_distances=True)
ivan-marroquin commented 2 years ago

Hi @VarIr

Many thanks for the suggestion!

Ivan