Additional ANN methods - Githubissues

VarIr / scikit-hubness

A Python package for hubness analysis and high-dimensional data mining

BSD 3-Clause "New" or "Revised" License

44 stars 9 forks source link

Additional ANN methods #22

Closed VarIr closed 3 years ago

VarIr commented 5 years ago

It would be nice to also have wrappers for

[x] annoy: https://github.com/spotify/annoy
[x] ONNG: https://github.com/yahoojapan/NGT/blob/master/README.md#onng
[x] Puffinn: https://github.com/puffinn/puffinn
[x] ~~PyNNdescent: https://github.com/lmcinnes/pynndescent (for custom metric support)~~ (wrappers already exist in the pynndescent package

UPDATE: For new requests on additional ANN methods, individual Issues should be opened.

VarIr commented 5 years ago

Added annoy in dev branch #21

VarIr commented 5 years ago

Added puffinn in dev branch #30

ivan-marroquin commented 3 years ago

Hi @VarIr

Many thanks for such great package!

I was wondering if you can include a wrapper for pyNNDescent at https://pynndescent.readthedocs.io/en/latest/api.html

Ivan

VarIr commented 3 years ago

Glad to hear you find the package useful.

Do you have specific reasons to use NNdescent? Last time I checked it seemed to provide inferior results compared to the graph-based methods like HNSW and ONNG.

ivan-marroquin commented 3 years ago

Hi @VarIr

Thanks for asking for my opinion.

I think that NNdescent offers something than other solutions do not have: the support of several metrics including custom metrics. This is aspect is very important for me. My data sets are in the form of n x d (where n >>> d, and d varies [7,12]). As you can see, I like to have a solution that allows me to test euclidean, manhattan, fractional or custom metrics to see which one will help me to deal with large amount of data in relative high dimensional space. And now, I started learning more about hubness and its impact. This is where your package will play an interesting role in the analysis that I conduct.

Ivan

VarIr commented 3 years ago

Thanks for your input. Including support for custom dissimilarity measures seems worthwhile. Currently, most (all?) of the implemented methods at least support Euclidean and Manhattan distances.

I'll add nndescent to the list here. Unfortunately, I cannot give ETA, as I can work on this project only in spare time.

ivan-marroquin commented 3 years ago

Thanks for considering pynndescent. In addition to support custom metrics, it also supports much more metrics than annoy package. For instance, the metric Minkowski with 0<p<1 seems to provide better results than Euclidean distance when working with high dimensional data

VarIr commented 3 years ago

Yes, that's why skhubness allows fractional norms while sklearn doesn't. :)

ivan-marroquin commented 3 years ago

Cool! I just need to be able to install this great package on my windows workstation (see incident #76 at https://github.com/VarIr/scikit-hubness/issues/67)

ivan-marroquin commented 3 years ago

I tried to use a fractional norm with the following code:

from skhubness.data import load_dexter from skhubness import Hubness hub= Hubness(k= 10, return_value= 'all', metric= 'minkowski', algorithm= 'hnsw', algorithm_params= {'p': 0.1}, hubness= 'local_scaling', random_state= 1969, n_jobs= -1) hub.fit(X)

which gave the error below:

Traceback (most recent call last): File "", line 1, in File "C:\Users\IMarroquin\Downloads\Important_Python_Libraries_VisualBuildTools\scikit-hubness-master\skhubness\analysis\estimation.py", line 283, in fit raise ValueError(f"Unknown metric '{metric}'. " ValueError: Unknown metric 'minkowski'. Must be one of ['euclidean', 'cosine', 'precomputed'].

For hubness analysis, the package only supports three metrics

VarIr commented 3 years ago

The next version v0.30 will see compatibility with sklearn's KNeighborsTransformer. Since PyNNDescent ships with its own wrapper to act as a KNeighborsTransformer, there is no need to roll an additional implementation of that. We can, thus, consider PyNNDescent supported.

VarIr commented 3 years ago

Closing, as the original list of wrappers has been dealt with. Users with requests for any other additional approximate neighbor wrappers, please open a new issue for each algorithm separately. This helps me to keep overview of open tasks. Thank you.