Closed BIOINSu closed 5 years ago
Hi @BIOINSu it looks like you want to pre-calculate the nearest neighbors for all data points and store them in memory for later retrieval. If that is the case, you don't really need to use ANNs. Algorithms for ANNs are meant to be used when there is not enough memory to store a large array of neighbors or when the query point is not known in advance so that it is not possible to pre-calculate the nearest neighbors. Thus, the query
function in LSH class implement querying only one point at a time. That being said, if you still want to use ANN algorithm for your application, the iterative call, as you said, will be the fastest. In the past, I have tried to implement multi point query using tensorflow and some fancier array broadcasting (the code for this is not included in the repo), but the memory overhead from those operations seems to negate any advantages gained from GPU processing, so that the current CPU version with numpy turns out to be faster. I hope this helps.
Thanks for the reply! But I am still confused that in your paper, you describe that the DenseFly algorithm also needed to create low-dimensional pseudo-hash bin. However, in your code, the query function in the LSH class just simply used the high-dimensional hash. Does it mean that for the DenseFly algorithm, I need to call the create_highd_bins and create_lowd_bins functions in the flylsh class manually and then call the query_lowd_bins and query_highd_bins after to finish the whole query process? Thank you!
That's a good question and it arises due to my failure to document the code properly. No, you don't need to implement these functions as they are already implemented in the code. The class inheritance works as follows: LSH
--> flylsh
--> denseflylsh
. So, flylsh
inherits all the methods of LSH
class and denseflylsh
inherits all the methods of both LSH
and flylsh
. I have already implemented create_lowd_bins
for flyhash
, so you can also use it for denseflylsh
. Please don't use create_highd_bins
, as it is very slow and was created for a comparison with lowd bins (If your data is high dimensional enough, most of the highd bins will have only one member, so there is no point in going through this process). With this in mind, the suggested workflow is:
data=read_data() #this can also be done by Dataset class
dfmodel=denseflylsh(data, hash_length, sampling_ratio, embedding_size) #please note that sampling ratio is not used internally, so you can just provide a dummy value. I kept this syntax for easily hot-swapping flylsh and denseflylsh during prototyping.
dfmodel.create_lowd_bins()
nns=dfmodel.query_lowd_bins(qidx,search_radius,order)
Let me know if you have any more questions.
Thanks! It worked perfectly. I still have a little question that what will happen if a new vector which is not in the original inputs_ comes? How does the DenseFly deal with the query which it hasn't seen before? (Does it mean that this new vector would be projected using the same self.weights
? But what will happen after then? ) Thank you!
At the moment, I have not implemented the functionality to query points outside the inputs_
array. The way to do it would be to calculate the L1 distance from the has of query point to the hash of the lowd bins and assign query_bin to be the bin with the lowest distance. Let me know if you would like to me add this function.
Thank you so much for answering so many questions of mine! You are really nice! Issues close.
Dear author! I am trying to use DenseFly to find k nearest neighbor to construct a nearest neighbor network. May I ask if there is any suggestion from you to construct this nearest neighbor network efficiently? Or I just iteratively use the query function in the LSH class? Thank you so much!