beringresearch / ivis

Dimensionality reduction in very large datasets using Siamese Networks
https://beringresearch.github.io/ivis/
Apache License 2.0
330 stars 43 forks source link

Extremely slow extraction of KNN neighbours on 100k samples #117

Closed adavoudi closed 2 years ago

adavoudi commented 2 years ago

I'm using ivis[cpu] on a dataset of about 100k samples with around 200k sparse features. My training dataset is stored in an h5 file and I use the following code to fit and transform the dataset:

with h5py.File(filename, 'r') as f:
      X = f['data']
      Y = pd.Categorical(meta_df["label"]).codes
      model = Ivis(epochs=5, k=15)
      model.fit(X, Y, shuffle_mode='batch') # Shuffle batches when using h5 files

      embeddings = model.transform(X)

However, it takes so long:

Building KNN index
100%|██████████| 105942/105942 [55:07<00:00, 32.03it/s]
Extracting KNN neighbours
  0%|          | 262/105942 [7:16:38<2935:20:19, 99.99s/it]

2935 hours!! Am I missing something? or this is expected? Should I switch to GPU?

By the way, I'm using a google colab system with 8 CPU cores, 50 GB Ram, and an SSD disk.

idroz commented 2 years ago

Thanks for raising this - this certainly shouldn't be taking that long - typically takes 2-3 minutes to extract KNNs from similarly-sized datasets on a machine with your specs.

It looks like the bottleneck might be reading the h5 file. I'm not sure how Google Colab is set up, but if the SSD is a network drive, I/O might be limited by network speed. Have you tried running this locally?

@Szubie - do we have anything supporting H5 files with ivis.data.sequence.IndexableDataset class?

adavoudi commented 2 years ago

@idroz Sorry the number of features is 200K, not 20k (I missed a zero in the description)! Does that change your answer?

Szubie commented 2 years ago

The H5 file seems to be working fine - the file was read to create the KNN index. For some reason it seems that using the Annoy index itself is the bottleneck. By default ivis places the annoy index in the /tmp directory, so if that dir is slow that would affect things - but doubt that is happening here.

It is possible that the number of features (200k) is too high for Annoy to handle cleanly. It claims to work up to 1,000 dimensions reasonably well, although we have used it up to 20k without issues. Playing with some of the knn retrieval parameters may help a bit. There are two main things to tune with Annoy - search_k and n_trees. To speed up the retrieval at expense of accuracy, try reducing the search_k and see if it helps at all; search_k defaults to k * n_trees if not specified.

If Annoy cannot handle the data and the above tweaking doesn't help, you could try two other things: 1) Preprocessing the data to reduce the number of dimensions before passing it is ivis. E.g. PCA. 2) Use some other method of extracting the KNN and pass in a neighbour matrix to Ivis.

Pursuing the second option would let you use any KNN retrieval method to get the neighbour indices. For example you might use faiss to retrieve the neighbour indices using a method that scales better and then pass in the results to Ivis which would then use them for training the neural network.

adavoudi commented 2 years ago

Thanks for your thorough answer! The problem was as you said the Annoy library. I reduced the dimension of my data and now it works perfectly.