PoonLab / OpenRDP

An open-source re-implementation of the RDP4 recombination detection program
GNU General Public License v3.0
45 stars 9 forks source link

Process and validate command line arguments #5

Closed kwade4 closed 3 years ago

kwade4 commented 3 years ago

The main pre-processing steps (common to all sequences) include:

In the original source code, the sequences are divided into chunks of 4 nucleotides and then converted to integers.

Upon installation of the program, the pairwise hamming distance is calculated for all 625 possible combinations of 5 characters (ATGC-) and these values are written to a file. During execution of the program, these values are stored in a lookup table, queried, and used to calculate the hamming distance.

kwade4 commented 3 years ago

Rather than caching the distances, I opted to encode the sequences as bit arrays and I calculate hamming distance by counting the 1's from bitseq1 XOR bitseq2. For now, performance seems fine, but if performance becomes an issue, I'll look into the caching approach.