andersen-lab / bjorn

GNU General Public License v3.0
20 stars 4 forks source link

Hashtables for caching and analysis #46

Open mindoftea opened 3 years ago

mindoftea commented 3 years ago

We can use GDBM to keep hashtables of sequence-level data in memory to help with performance as well as analysis tasks

Required tables: id2dat: accession_id -> {metadata} + {mutations} + {lineage} key2id: ++ -> [sequence_ids]

We need id2dat to implement caching, where bjorn can ignore sequences it has seen before, and key2id is important for growth-rates analysis.

Optional tables: id2seq: accession_id -> [covz compressed sequence data] mut2id: mut_code -> [sequence_ids]

Id2seq might be useful for testing new features and performing more in-depth sequence-level analysis, and mut2id can be used for keeping nucleotide statistics. Likely, mut2id's max row length will be several MB, so it should be stored in a sorted file instead of GDBM.

Once this is done, caching can be implemented as an extension of the filtering step.