elixir-nx / scholar

Traditional machine learning on top of Nx
Apache License 2.0
416 stars 43 forks source link

Alternatives to kNN + t-SNE #207

Closed josevalim closed 7 months ago

josevalim commented 11 months ago

We don't need to implement all options below but enough to have some efficient alternatives to the linear/brute force versions we have today. My suggestion is to pick whatever you find to be simplest.

Alternatives to kNN:

Alternatives to tSNE:

Bonus:

krstopro commented 10 months ago

I am still working on this, but everything I implement is just too slow. One of the obstacles I encountered is removing the duplicates among neighbor indices. This is important for neighbor expansion (finding neighbors of neighbors; this is the idea behind NNDescent, though just by itself works fine - see EFANNA paper). It boils down to taking first k distinct elements from a tensor. Looping with while is just too slow or I am not doing it correctly. Is there a faster way to do this with Nx?

josevalim commented 10 months ago

I would have to see the algorithm in order to give feedback, but realistically speaking, I would only be able to look into this in 3 weeks (going on holidays).

krstopro commented 10 months ago

@josevalim The idea is very simple. Support we have a candidate graph, i.e. a matrix of shape n x k where i-th row values are candidates for k-nearest neighbors of x[i] (i-th point in the dataset). Then we can obtain the neighbors of neighbors just by doing Nx.take(graph, graph). These can be concatenated with the original neighbors (k^2 + k columns) and sorted by the distance from x[i]. However, there can be duplicates and they should be eliminated by taking the first k distinct values and this should be done in parallel for every row i. I don't see a way to do this efficiently with Nx. So far I've vectorized over i and then used while loop to skip values that I've already encountered, but it seems too slow.

josevalim commented 10 months ago

My suggestion for now is to question the assumption if you really need to remove the duplicates. In this algorithm here, the updateNN operation is idempotent, so you could call it for duplicates without loss or repeated information.

jonatanklosko commented 10 months ago

It boils down to taking first k distinct elements from a tensor.

@krstopro

  1. Does it have to be k first going left-to-right or can it be any subset (specifically sorted)?
  2. Is it guaranteed there are k distinct elements in the tensor? If not, what should happen?
  3. Is the subset of k elements the final result, or is there some immediate operation we want to do with these elements?
krstopro commented 10 months ago

It boils down to taking first k distinct elements from a tensor.

@krstopro

  1. Does it have to be k first going left-to-right or can it be any subset (specifically sorted)?
  2. Is it guaranteed there are k distinct elements in the tensor? If not, what should happen?
  3. Is the subset of k elements the final result, or is there some immediate operation we want to do with these elements?

@jonatanklosko

  1. The idea is to argsort the points by the distance from i-th point and take k closest. Therefore, it should be the first k going left-to-right. However, some points can appear multiple times as several neighbors of i can have the same point as neighbor.
  2. Yes, we are just expanding the neighbors. We look at the candidates for k nearest neighbors for i-th point and take their neighbors to update the candidate set. There will always be k distinct points as we are just adding k^2 points to the tensor.
  3. Yes, that should be the final result. The entire procedure should be repeated over and over again though.

Here is a nice video explaining it concisely.

jonatanklosko commented 10 months ago

@krstopro

defmodule Utils do
  import Nx.Defn

  @doc """
  Gets `n` first unique elements from the tensor.

  There must be at least `n` unique elements, otherwise this function
  returns all unique elements and then duplicate elements in the same
  order as in the original tensor.

  ## Examples

      iex> Utils.uniq(Nx.tensor([4, 2, 4, 3, 2, 5, 1, 2]), 5)
      #Nx.Tensor<
        s64[5]
        [4, 2, 3, 5, 1]
      >

      iex> Utils.uniq(Nx.tensor([4, 2, 4, 3, 2, 5, 1, 2]), 7)
      #Nx.Tensor<
        s64[7]
        [4, 2, 3, 5, 1, 4, 2]
      >

  """
  deftransform uniq(tensor, n) do
    uniq_n(tensor, n: n)
  end

  defnp uniq_n(tensor, opts \\ []) do
    n = opts[:n]

    {length} = Nx.shape(tensor)

    sorted_idx = Nx.argsort(tensor, stable: true)
    sorted = Nx.take(tensor, sorted_idx)

    duplicate_mask =
      Nx.concatenate([
        # The first element is not a duplicate
        Nx.tensor([0]),
        # Check if an element is equal to its predecessor
        Nx.equal(sorted[1..-1//1], sorted[0..-2//1])
      ])

    # Build a permutation that moves duplicate elements to the end
    permutation =
      Nx.iota({length})
      |> Nx.indexed_add(
        Nx.new_axis(sorted_idx, -1),
        Nx.select(duplicate_mask, length, 0)
      )
      |> Nx.argsort()

    Nx.take(tensor, permutation[0..(n - 1)//1])
  end
end
josevalim commented 10 months ago

@krstopro do you already have some code to share? I will be flying today and I may have some time for coding, so maybe I can play with it. :) No worries if you don't.

krstopro commented 10 months ago

@josevalim Still on it. Right now it seems to be blowing up the memory and the code is quite dirty. I should also change some methods used (i.e. use different version of LSH). Might upload something soon though.

msluszniak commented 10 months ago

I've been researching AkNN with EFANNA and with LSH and many others (FLANN, GNNS, kGraph, IEH, Annoy from Spotify). And all the implementation seems to be more complicated than KNNDescent, so I think it would be better to first try the KNNDescent and then progress with other algorithms. I also dug into a discussion in Sklearn from 2017 and it turned out that sklearn had ANN 6 years ago but they eventually deprecated it, because it wasn't faster than an exact implementation https://github.com/scikit-learn/scikit-learn/issues/8996. So, I'm a little bit concerned that even though we implement everything the best we can in Nx the ANN might be still not much faster than an exact solution. However, the TriMap and PaCMAP implementations seem to be pretty straightforward to achieve in Nx :))

krstopro commented 10 months ago

Alright, I think I finally made some progress. The code is available here https://gist.github.com/krstopro/0e20ea7f5ec21f6243bde2b316b1c649 @josevalim sorry it took me way longer that I expected. I changed my approach several times as most of the stuff I tried didn't work in the end. Hope you're enjoying your holiday at this point.

msluszniak commented 10 months ago

@krstopro do know some reasonable values of num_trees and num_iters so I'll benchmark it and compare to exact kdtree?

krstopro commented 10 months ago

@msluszniak num_iters should be small; 1 should work for most cases. num_trees should depend on the dataset size. I am still tuning it, but something like $n^{1/4}$ might work, where $n$ is the dataset size.

krstopro commented 10 months ago

@msluszniak Updated the code slightly. Looking at nearest neighbor with kd-tree: can this part be vectorized with Nx.vectorize given that query_one_point/3 is within defn? https://github.com/elixir-nx/scholar/blob/8826219bfa978aaf360971b4a4efb6dd6ac19b68/lib/scholar/neighbors/kd_tree.ex#L414-L420

msluszniak commented 10 months ago

Yes, you're right, we should vectorize this part of the code

krstopro commented 9 months ago

I am almost ready to push. There is one thing I would like to agree on: should the methods for constructing k-NN graph be separated from those used for classification and regression (e.g. in a separate module or even folder)?

josevalim commented 9 months ago

Don't worry about the naming, it will be easy to bikeshed on it once we see the code and functions it requires!

josevalim commented 9 months ago

Hi @krstopro! Just so we can plan on our side, I assume you are going to submit LargeVis, as you have almost all of it ready. Do you want us to look at NNDescent? Or do you have it on your plate to?

You have been amazing work so we don't want to steal your thunder. There is no rush either. Just coordinating. :) Happy new year! 🎉 🍾

krstopro commented 9 months ago

Hi @krstopro! Just so we can plan on our side, I assume you are going to submit LargeVis, as you have almost all of it ready.

Yes, you can leave it to me.

Do you want us to look at NNDescent? Or do you have it on your plate to?

I didn't think of implementing it, as the k-NN method from LargeVis might be sufficient for implementing other dimensionality reduction methods (t-SNE, TriMAP, PacMAP, etc.).

You have been amazing work so we don't want to steal your thunder. There is no rush either. Just coordinating. :) Happy new year! 🎉 🍾

Thanks to you, @msluszniak, and @jonatanklosko for all the support. I hope to push the code soon. Happy new year as well!

msluszniak commented 9 months ago

@krstopro do you think then that we don't need to implement Annoy to implement TriMAP and PacMAP?

Happy new year! :)

krstopro commented 9 months ago

@msluszniak

@krstopro do you think then that we don't need to implement Annoy to implement TriMAP and PacMAP?

Quite possible. Annoy is also based on random hyperplane trees. They are constructed in a different way: in every node two points are randomly picked and separated by a hyperplane; points on one side of the hyperplane are going in the left subtree, points on the other side are going in the right subtree. I am not sure, but I think the main reason for constructing trees in this way is to avoid computing the median. In the current Scholar implementation, points are projected onto a random hyperplane and median is calculated; points with projection lower than the median are going into the left subtree, points with projection higher than the median are going into the right subtree. Like this the trees remain balanced and can be constructed and queried within defn. Again, I am not sure, but I wouldn't expect the recall to differ much between the two. Besides that, Annoy is not refining the k-NN graph using NN-expansion as done in LargeVis; this can improve the performance substantially.

It is also questionable how high the recall must be for the dimensionality reduction methods to work properly. From FIt-SNE paper:

"A recent theoretical advance by Linderman et al. (2017) can be used to optimize this step: it suggests that connecting every point to its (for example) k = 100 nearest neighbors is not more effective than connecting every point to 2 randomly chosen points out of its 100 nearest neighbors. The main reason is that this randomized procedure, when executed on point clouds lying on manifolds, creates expander graphs at the local scale which represent the local geometry accurately at a slightly coarser level. ... In practice, we use this result to justify the replacement of nearest neighbors with approximate nearest neighbors. Specifically, we compute approximate nearest neighbors using a randomized nearest neighbor method called ANNOY (Bernhardsson (2017)), as we expect the resulting “near neighbors” to capture the local geometry at least as effectively as the same number of nearest neighbors."

I would still need to see it working to be convinced though.

Happy new year! :)

Same, all the best for you too!

msluszniak commented 9 months ago

Ok, I understand. Thanks :)

josevalim commented 8 months ago

Now with NNDescent in review, I have updated the list above. I have removed UMAP because its implementation requires sparse matrices and it seems to be the more complex of the three to implement. TriMap has a JAX implementation, so it should hopefully be the most accessible of all three. In case it helps anyone, here is a conversation with ChatGPT about this (I have validated the claims I found important, but you should double check anything mentioned).

josevalim commented 7 months ago

Beautiful work. I have created smaller issues for the remaining parts #237 #238. :)