How to construct a nearest neighbor network efficiently?

BIOINSu commented 5 years ago

Dear author! I am trying to use DenseFly to find k nearest neighbor to construct a nearest neighbor network. May I ask if there is any suggestion from you to construct this nearest neighbor network efficiently? Or I just iteratively use the query function in the LSH class? Thank you so much!

dataplayer12 commented 5 years ago

Hi @BIOINSu it looks like you want to pre-calculate the nearest neighbors for all data points and store them in memory for later retrieval. If that is the case, you don't really need to use ANNs. Algorithms for ANNs are meant to be used when there is not enough memory to store a large array of neighbors or when the query point is not known in advance so that it is not possible to pre-calculate the nearest neighbors. Thus, the query function in LSH class implement querying only one point at a time. That being said, if you still want to use ANN algorithm for your application, the iterative call, as you said, will be the fastest. In the past, I have tried to implement multi point query using tensorflow and some fancier array broadcasting (the code for this is not included in the repo), but the memory overhead from those operations seems to negate any advantages gained from GPU processing, so that the current CPU version with numpy turns out to be faster. I hope this helps.

BIOINSu commented 5 years ago

Thanks for the reply! But I am still confused that in your paper, you describe that the DenseFly algorithm also needed to create low-dimensional pseudo-hash bin. However, in your code, the query function in the LSH class just simply used the high-dimensional hash. Does it mean that for the DenseFly algorithm, I need to call the create_highd_bins and create_lowd_bins functions in the flylsh class manually and then call the query_lowd_bins and query_highd_bins after to finish the whole query process? Thank you!

dataplayer12 commented 5 years ago

That's a good question and it arises due to my failure to document the code properly. No, you don't need to implement these functions as they are already implemented in the code. The class inheritance works as follows: LSH --> flylsh --> denseflylsh. So, flylsh inherits all the methods of LSH class and denseflylsh inherits all the methods of both LSH and flylsh. I have already implemented create_lowd_bins for flyhash, so you can also use it for denseflylsh. Please don't use create_highd_bins, as it is very slow and was created for a comparison with lowd bins (If your data is high dimensional enough, most of the highd bins will have only one member, so there is no point in going through this process). With this in mind, the suggested workflow is:

data=read_data() #this can also be done by Dataset class
dfmodel=denseflylsh(data, hash_length, sampling_ratio, embedding_size) #please note that sampling ratio is not used internally, so you can just provide a dummy value. I kept this syntax for easily hot-swapping flylsh and denseflylsh during prototyping.
dfmodel.create_lowd_bins()
nns=dfmodel.query_lowd_bins(qidx,search_radius,order)

Let me know if you have any more questions.

BIOINSu commented 5 years ago

Thanks! It worked perfectly. I still have a little question that what will happen if a new vector which is not in the original inputs_ comes? How does the DenseFly deal with the query which it hasn't seen before? (Does it mean that this new vector would be projected using the same self.weights? But what will happen after then? ) Thank you!

dataplayer12 commented 5 years ago

At the moment, I have not implemented the functionality to query points outside the inputs_ array. The way to do it would be to calculate the L1 distance from the has of query point to the hash of the lowd bins and assign query_bin to be the bin with the lowest distance. Let me know if you would like to me add this function.

BIOINSu commented 5 years ago

Thank you so much for answering so many questions of mine! You are really nice! Issues close.

dataplayer12 / Fly-LSH

How to construct a nearest neighbor network efficiently? #6