Hashtables for caching and analysis

We can use GDBM to keep hashtables of sequence-level data in memory to help with performance as well as analysis tasks

Required tables: id2dat: accession_id -> {metadata} + {mutations} + {lineage} key2id: ++ -> [sequence_ids]

We need id2dat to implement caching, where bjorn can ignore sequences it has seen before, and key2id is important for growth-rates analysis.

Optional tables: id2seq: accession_id -> [covz compressed sequence data] mut2id: mut_code -> [sequence_ids]

Id2seq might be useful for testing new features and performing more in-depth sequence-level analysis, and mut2id can be used for keeping nucleotide statistics. Likely, mut2id's max row length will be several MB, so it should be stored in a sorted file instead of GDBM.

Once this is done, caching can be implemented as an extension of the filtering step.

andersen-lab / bjorn

Hashtables for caching and analysis #46