VarIr / scikit-hubness

A Python package for hubness analysis and high-dimensional data mining
BSD 3-Clause "New" or "Revised" License
44 stars 9 forks source link

Using hnsw and lsh as classifier #86

Closed QueSabz closed 1 year ago

QueSabz commented 2 years ago

How can I use both methods for data classification rather than just retrieving nearest neighbours. Or should I just use it like this kNN = KNeighborsClassifier(n_neighbors=50, algorithm = m) where m can be hnsw or lsh?

VarIr commented 2 years ago

I assume you want to compare the two methods with respect to your classification task? In this case, I would approach this exactly the way you suggested. Simply create two KNeighborsClassifiers, one with algorithm="hnsw" and one with algorithm="lsh".

Note that v0.30 is currently under development. It's not yet finished, but it should be possible to install it from the main branch here. It integrates nicely with recent versions of scikit-learn that provide the KNeighborsTransformer. That is, you can precompute k-neighbors graphs with various (approximate) neighbor methods with scikit-hubness, and feed the results to any scikit-learn function that works on such graphs, including classification.

QueSabz commented 2 years ago

I have installed it and it seems to be working but when I import it on my notebook I get the following warnings: 2022-03-14 12_21_52-SpeedUp Methods - Jupyter Notebook

VarIr commented 2 years ago

These are known issues of an old scikit-learn version when used with newer numpy versions. They do not cause any errors and you may ignore them, or even filter them away:

import warnings

with warnings.catch_warnings():
    warnings.filterwarnings("ignore", category=DeprecationWarning)
    from sklearn import ...  # Insert all your sklearn-related imports here

These warnings will also go away with scikit-hubness v0.30 which works with scikit-learn versions > 0.21.

QueSabz commented 2 years ago

How do I upgrade to scikit-hubness v0.30? I tried using the following command on anaconda : pip install -U scikit-hubness but I got the followin 2022-03-14 13_03_32-C__Windows_system32_cmd exe g

VarIr commented 2 years ago

A note of caution first, v0.30 is experimental at this point and might have all kinds of problems as is. If you still want to try it, I suggest creating a new conda environment and installing it there.

# Install setup requirements
python3 -m pip install --upgrade pip
python3 -m pip install setuptools wheel pybind11

# Install approx. neighbors libraries
python3 -m pip install --no-binary :all: nmslib
python3 -m pip install annoy 
# any other ANN packages you'd like to use, e.g., ngt, puffinn

# Install scikit-hubness from main branch
python3 -m pip install git+https://github.com/VarIr/scikit-hubness.git
QueSabz commented 2 years ago

2022-03-14 13_46_59-C__Windows_system32_cmd exe

VarIr commented 2 years ago

Can you please try without the [ann] part in the end? Perhaps this does not work when installing directly from github. The difference would only be whether other packages like nmslib or annoy are installed automatically at the same time. To use HNSW, you would then also have to pip install nmslib.

QueSabz commented 2 years ago

I have done all the steps in a new environment as suggested but when importing from skhubness I get the following error message: image

VarIr commented 2 years ago

From v0.30 scikit-hubness does provide an own implementation of KNeighborsClassifier anymore. Instead, you can use the one from scikit-learn. The workflow will be as outline here. For example, you can create a pipeline of NMSlibTransformer and KNeighborsClassifier.

QueSabz commented 2 years ago

In terms of pipeline creation I am not well versed. Is there any possibility I can see how this is done for lsh and HNSW classification methods as I really want to use them to do comparison in terms of time complexity and accuracy. For example I have already used the exact methods which are brute force, kd tree, and ball tree in this manner: image

VarIr commented 2 years ago

I would in this case suggest you stick with the old scikit-hubness version for the time being, because I don't really have any tested example code for the new version at this point. It will take a while until I manage to update documentation and examples.

QueSabz commented 2 years ago

Okay thanks. So with the old version there is no problem to use both lsh and hnsw as classifiers? I can have some thing like this: models = ['brute','kd_tree', 'ball_tree', 'lsh', 'hnsw']

VarIr commented 2 years ago

Yes, this should work.

QueSabz commented 2 years ago

2022-03-17 16_17_49-Approximate Nearest Neighbour - Jupyter Notebook

Is there a way of installing and importing puffinn package because I want to use it for lsh method I have installed it using: pip install puffin But still when I run my notebook I get error above

This is the error I get and I tried following the steps and still failed: ImportError: Please install the puffinn package, before using this class: $ git clone github.com/puffinn/puffinn.git $ cd puffinn $ python3 setup.py build $ pip install .

VarIr commented 2 years ago

puffinn (note the double-n) is not available from PyPI. puffin is an unrelated package. Did you get any warnings/error messages from following the manual puffinn installation steps? If no, did you install it in the same conda environment you are using in this Jupyter notebook?

QueSabz commented 2 years ago

image I followed instruction from this page and hnsw is working perfectly with nmslib installed but its a different story for lsh

QueSabz commented 2 years ago

Even the falconn package have some issues when I am trying to install it using: pip install FALCONN==1.3.1 image

VarIr commented 2 years ago

I only now realized you are on Windows. Unfortunately, falconn does not support Windows, and neither does puffinn. So LSH is only available on Linux or Macos.

QueSabz commented 2 years ago

Do you have any idea How can I modify/add scoring method on the following annoy wrapper class so that it can fully function as a classifier:

import annoy 
from collections import Counter
from sklearn.base import BaseEstimator
#Wrapper for using annoy.AnnoyIndex as sklearn's KNeighborsTransformer

class Annoy(BaseEstimator):
    def __init__(self,n_neighbors=5, metric='euclidean', n_trees=10):
        self.n_neighbors = n_neighbors
        self.Ametric = metric
        self.n_trees = n_trees

    def fit(self, X_train, y_train):
        self.N_feat = X_train.shape[1]
        self.N_train = X_train.shape[0]
        self.y_train = y_train
        self.t = annoy.AnnoyIndex(self.N_feat,metric=self.Ametric)
        for i, v in zip(range(self.N_train), X_train):
            self.t.add_item(i, v)
        self.t.build(self.n_trees)
        return self

    def predict(self,X_test):
        y_pred = []
        for tv in X_test:
            nn_inds = self.t.get_nns_by_vector(tv, self.n_neighbors)
            nn_classes =[self.y_train[nn] for nn in nn_inds]
            y_hat.append(most_frequent(nn_classes))
        return y_pred
def most_frequent(List): 
    occurence_count = Counter(List) 
    return occurence_count.most_common(1)[0][0] 
VarIr commented 2 years ago

You could try inheriting from the ClassifierMixin which provides the score() method. Or explicitly write the method. If you take a look at the linked code, it's literally two lines of code.

QueSabz commented 2 years ago

Right no I am using UBUNTU but still when I am installing the sk-hubness package it breaks when building and installing falconn package. Are there any specific steps/commands I need to follow when using ubuntu?

VarIr commented 2 years ago

Please post the error message you get.

QueSabz commented 2 years ago

Screenshot from 2022-04-01 13-17-26

QueSabz commented 2 years ago

Screenshot from 2022-04-01 13-21-09

QueSabz commented 2 years ago

Screenshot from 2022-04-01 13-23-51

VarIr commented 2 years ago

I quickly googled these error messages and some reports on insufficient memory popped up. But this is a wild guess. I fear I cannot give more support on falconn compilation bugs. You may try to reach out to the falconn devs about this. (Note, however, that falconn has seen no development for over 4 years, for which reason I am dropping falconn support completely with v0.30).

VarIr commented 1 year ago

With last activity in this thread nearly one year ago, I'll close this now. Feel free to open a new issue, if further questions arise.