For very large databases, creation of TST is slow and memory intensive

dkoslicki / CMash

Fast and accurate set similarity estimation via containment min hash

BSD 3-Clause "New" or "Revised" License

42 stars 9 forks source link

For very large databases, creation of TST is slow and memory intensive #29

Open dkoslicki opened 4 years ago

dkoslicki commented 4 years ago

Will need to:

[x] modularize the content of MakeStreamingDNADatabase.py
[x] create methods to make the TST the old way
[x] create method to make TST without reading in the whole training database, and yielding the results to mt.Trie()

Work will happen in the branch ModularizeMake

dkoslicki commented 4 years ago

Will concentrate first on the TST before modularizing the CountEstimators modularization

dkoslicki commented 4 years ago

On a database with 1000 genomes: New timing: 31.374449986004038 Old timing: 36.79385429399554 Old timing (including import time): 39.00447271599842

For memory

pip install psrecord
python Make.py & psrecord $(pidof python) --interval 1 --plot <out_file>.png

New way: new Old way: old

So that's a win for the new method of TST creation.

dkoslicki commented 4 years ago

Note to self, piping from C++ marisa trie implementation works as well: LocalTest2.py:

import sys

def main():
    sys.path.append("/home/dkoslicki/Desktop/CMash/")
    from CMash.Make import MakeTSTNew
    small_database_file = "/home/dkoslicki/Desktop/CMash/tests/TempData/cmash_db_n5000_k60_1000.h5"
    TST_export_file_new = "/home/dkoslicki/Desktop/CMash/tests/TempData/cmash_db_n5000_k60_new.tst"
    M = MakeTSTNew(small_database_file, TST_export_file_new)
    for entry in M.yield_trie_items_to_insert_no_import(small_database_file):
        print(entry)

if __name__ == "__main__":
    main()

Then

 python ../../CMash/LocalTest2.py | /usr/bin/time ~/Desktop/marisa-trie/tools/marisa-build > cmash_db_n5000_k60_new.tst

results in same md5sum as old and new methods. Timing is 35.4 sec.

For marisa-trie C++ installation, note:

$ git clone https://github.com/s-yata/marisa-trie.git
$ cd marisa-trie
$ sudo apt-get install autoconf
$ sudo apt-get install libtool
$ autoreconf -i
$ ./configure --enable-native-code

dkoslicki commented 4 years ago

And C++ implementation results in the following memory usage via:

python ../../CMash/LocalTest2.py | /usr/bin/time ~/Desktop/marisa-trie/tools/marisa-build > cmash_db_n5000_k60_new.tst & sleep 1;  psrecord $(pidof marisa-build) --interval 1 --plot marisa.png

marisa

So a touch better memory usage. Will need to experiment with C++ CLI opts

dkoslicki commented 4 years ago

Problem: MakeTSTNew is still rather memory intensive when creating large databases (~200K genomes) with many hashes (-n 5000). C++ marisa-trie seems to overcome some of this due to its ability to stream input in, but may not be able to overcome the fact that the TST needs to be completely held in memory before being constructed (as each additional key needs to have access to the whole TST to know where to be placed). May need to carefully balance how things are streamed into the TST (so the pipe doesn't need to hold a ton of data as the TST slowly constructs itself).

Alternatives: Search for a different TST option.

Note: this is mainly for very large databases with lots of sketches, so not a massive issue for applications like Metalign atm.

dkoslicki commented 4 years ago

Modularization is complete and appears to be working, so merging to master for now as it does give a speedup.

@dkoslicki NOTE: make sure that nothing funky is happening with the file name order in the switch from pool.map to pool.imap edit: According to the docs, it should be in the same order, but I might consider a chunksize larger than 1 (maybe length of iterable/core count?)