dstreett / Super-Deduper

An application to remove PCR duplicates from high throughput sequencing runs.
11 stars 4 forks source link

Add RC into develop. #33

Open dstreett opened 8 years ago

dstreett commented 8 years ago

Need to RC the ID region.

The way I handled it in master is taking the greatest ID and using that as the unique value versus having two keys for each node.

bioSandMan commented 8 years ago

In binarySearch.cpp of the NewBin branch, in the function BinarySearchTree::FlipBitsChars() we moved the check for RC and the functional response. We’re no longer building the reverse complement of the string. This is the code that was changed:

   if (seq_2 == NULL) {
        seq = (char *)malloc(sizeof(char) * charLength + 1);
        sprintf(seq, "%.*s", charLength, seq_1 + start);
    } 
    else {
        seq = (char *)malloc(sizeof(char) * charLength * 2 + 1 );
        sprintf(seq, "%.*s%.*s", charLength, seq_1 + start, charLength, seq_2 + start);
    }

to

   if (seq_2 == NULL) {
        seq = (char *)malloc(sizeof(char) * charLength + 1);
        sprintf(seq, "%.*s", charLength, seq_1 + start);
    } 
  // use the reverse complement to generate the ID
    else if (RC) {
        seq = (char *)malloc(sizeof(char) * charLength * 2 + 1 );
        sprintf(seq, "%.*s%.*s", charLength, seq_2 + start, charLength, seq_1 + start); // simply add the sequences in reverse order
    }
    else {
        seq = (char *)malloc(sizeof(char) * charLength * 2 + 1 );
        sprintf(seq, "%.*s%.*s", charLength, seq_1 + start, charLength, seq_2 + start);
    }

With this change, you should find that you no longer need the two RC functions, RC_BP and RC_Read.

I compiled and tested this code against a small subset of reads (attached) with reverse complements. With the code change above, NewBin now finds the reverse complements that Master finds.

diff_R1.txt diff_R2.txt