Id2seq might be useful for testing new features and performing more in-depth sequence-level analysis, and mut2id can be used for keeping nucleotide statistics. Likely, mut2id's max row length will be several MB, so it should be stored in a sorted file instead of GDBM.
Once this is done, caching can be implemented as an extension of the filtering step.
We can use GDBM to keep hashtables of sequence-level data in memory to help with performance as well as analysis tasks
Required tables: id2dat: accession_id -> {metadata} + {mutations} + {lineage} key2id:++ -> [sequence_ids]
We need id2dat to implement caching, where bjorn can ignore sequences it has seen before, and key2id is important for growth-rates analysis.
Optional tables: id2seq: accession_id -> [covz compressed sequence data] mut2id: mut_code -> [sequence_ids]
Id2seq might be useful for testing new features and performing more in-depth sequence-level analysis, and mut2id can be used for keeping nucleotide statistics. Likely, mut2id's max row length will be several MB, so it should be stored in a sorted file instead of GDBM.
Once this is done, caching can be implemented as an extension of the filtering step.