The way we perform alignments could be much more efficient. We toss all reads with an insertion or deletion, so we are assuming a priori that the returned read aligns to the reference. As a result, there should be no need to perform a global alignment with Biopython -- we can just compare the reads to the reference, aligning the tail-ends of the reads to the appropriate ends of the reference. Reads with a given number of mismatches can then be discarded.
Doing this would allow us to (1) avoid the O(n^2) memory requirement for aligning to a reference of length n, (2) ordinally encode characters from the beginning, thus saving on memory, and (3) take advantage of vectorization with numpy to perform alignment QC and counting.
We may also want to play around with when exactly new processes are spawned for data analysis. Ideally, we want to send as little data as possible to the spawned processes, then return only what we need to comprehensively analyze all wells. Reorganizing code to maximize this transfer/memory efficiency should also reduce memory bloat.
The way we perform alignments could be much more efficient. We toss all reads with an insertion or deletion, so we are assuming a priori that the returned read aligns to the reference. As a result, there should be no need to perform a global alignment with Biopython -- we can just compare the reads to the reference, aligning the tail-ends of the reads to the appropriate ends of the reference. Reads with a given number of mismatches can then be discarded.
Doing this would allow us to (1) avoid the O(n^2) memory requirement for aligning to a reference of length n, (2) ordinally encode characters from the beginning, thus saving on memory, and (3) take advantage of vectorization with numpy to perform alignment QC and counting.
We may also want to play around with when exactly new processes are spawned for data analysis. Ideally, we want to send as little data as possible to the spawned processes, then return only what we need to comprehensively analyze all wells. Reorganizing code to maximize this transfer/memory efficiency should also reduce memory bloat.