Store only hash tables in memory

funtusov commented 8 years ago

As far as I understand, the underlying algorithm doesn't require storing the dataset in memory, only the generated hash tables.

Given the favorable computational complexity, falconn is very interesting to use with large datasets, but then storing whole datasets in memory becomes either infeasible or very expensive.

All in all, I'm not sure where the requirement of storing the whole dataset comes from – is it an inherent limtation of the method or was it just not implemented yet (and if so, are there plans to?)

ludwigschmidt commented 8 years ago

That's a good point. First, just to clarify: FALCONN never makes a copy of the dataset it uses for constructing the table, but expects that the dataset is still accessible for some operations. If you only want to know the candidate set and not the nearest neighbors among the candidates, FALCONN does not require access to the dataset after constructing the tables. Currently we do not explicitly support this feature, but if you only use the get_candidates_with_duplicates() and get_unique_candidates() methods, FALCONN should not try to access the dataset during queries.

In the future, we will make the distinction between candidate generation and nearest neighbor computation more clear.

funtusov commented 8 years ago

Just to confirm, is it the case that FALCONN never copies the data even when used via python bindings? In my experience, the memory consumption was essentially doubling at the start of indexing, and then increasing a bit further when hash tables were constructed – and changing the number of bits per hash and the number of hash tables seemed to suggest this is what was happening.

ludwigschmidt commented 8 years ago

The C++ library never copies data. For the Python bindings, the story is a bit more complicated. There are two levels of wrapping going on in the Python bindings:

A: The inner wrapper is using swig. The swig numpy bindings should not make a copy. But I'm not 100% certain about this. For instance, when the input to the swig wrapper is not a flat memory array internally, maybe the swig numpy wrapper creates a copy. Is your data a flat memory array internally, i.e, a simply NumPy array?

B: The outer wrapper is just a thin Python wrapper around the swig bindings. Its goal is to improve usability. The thin wrapper stores a reference to the dataset to avoid cases where the garbage collector removes the dataset and the user assumes FALCONN would still work (this simply leads to a segmentation fault).

We haven't really optimized the Python bindings for memory efficiency so far, partly because we assumed that people would use the C++ bindings for memory-critical settings.

Would the C++ library be an option for you? If you have ideas for how to prevent memory copies in the Python bindings, do let us know!

funtusov commented 8 years ago

Thanks for your reply, we'll investigate using c++ directly vs some basic python binding that doesn't keep the copied data.

That opens the question about saving/loading hash tables, since the workflow would be to create the hash tables on a huge instance (or create them sequentially) and then use the generated hash tables on a smaller instance. But, as I understand, that is already tracked in https://github.com/FALCONN-LIB/FALCONN/issues/46

ludwigschmidt commented 8 years ago

I'm not exactly sure how you would use the generated hash tables on a smaller instance. But #46 is indeed the ideal solution for loading and storing hash tables.

ludwigschmidt commented 7 years ago

@funtusov Could you control the memory usage of FALCONN in the end?

funtusov commented 7 years ago

@ludwigschmidt not really, we eventually went with an in-house LSH implementation in c++

ludwigschmidt commented 7 years ago

OK, thanks for the update. We'll investigate this when we update the Python wrapper in #69 .

FALCONN-LIB / FALCONN

Store only hash tables in memory #57