fhalab / evSeq

Computational tools for extremely low-cost, massively parallel amplicon-based sequencing of every variant in protein mutant libraries.
https://fhalab.github.io/evSeq/
Other
29 stars 9 forks source link

Refactor Alignment and Data Processing #34

Open brucejwittmann opened 2 years ago

brucejwittmann commented 2 years ago

The way we perform alignments could be much more efficient. We toss all reads with an insertion or deletion, so we are assuming a priori that the returned read aligns to the reference. As a result, there should be no need to perform a global alignment with Biopython -- we can just compare the reads to the reference, aligning the tail-ends of the reads to the appropriate ends of the reference. Reads with a given number of mismatches can then be discarded.

Doing this would allow us to (1) avoid the O(n^2) memory requirement for aligning to a reference of length n, (2) ordinally encode characters from the beginning, thus saving on memory, and (3) take advantage of vectorization with numpy to perform alignment QC and counting.

We may also want to play around with when exactly new processes are spawned for data analysis. Ideally, we want to send as little data as possible to the spawned processes, then return only what we need to comprehensively analyze all wells. Reorganizing code to maximize this transfer/memory efficiency should also reduce memory bloat.