Closed kwade4 closed 3 years ago
Rather than caching the distances, I opted to encode the sequences as bit arrays and I calculate hamming distance by counting the 1's from bitseq1 XOR bitseq2. For now, performance seems fine, but if performance becomes an issue, I'll look into the caching approach.
The main pre-processing steps (common to all sequences) include:
In the original source code, the sequences are divided into chunks of 4 nucleotides and then converted to integers.
Upon installation of the program, the pairwise hamming distance is calculated for all 625 possible combinations of 5 characters (ATGC-) and these values are written to a file. During execution of the program, these values are stored in a lookup table, queried, and used to calculate the hamming distance.