DMU-lilab / pTrimmer

Used to trim off the primer sequence from mutiplex amplicon sequencing
GNU General Public License v3.0
21 stars 5 forks source link

Support for IUPAC ambiguity codes other than N? #20

Open peterjc opened 2 years ago

peterjc commented 2 years ago

Does pTrimmer support IUPAC ambiguity codes in the input primer sequences?

Looking at the code and examples, there appears to be nothing to consider primers given with other ambiguity codes like M meaning A or C. This surprises me as ambiguous primers are quite common in my experience.

https://en.wikipedia.org/wiki/Nucleic_acid_notation

Quoting dynamic.c you have:

            int match = (s1[i-1] == s2[j-1])
                        || ((degenerate & ALLOW_WILDCARD_SEQ1) && (s1[i-1] == 'N'))
                        || ((degenerate & ALLOW_WILDCARD_SEQ2) && (s2[j-1] == 'N'));
            int cost_diag = tmp_entry.cost + (match ? MATCH_COST : MISMATCH_COST);
            int cost_deletion = column[i].cost + DELETION_COST;
            int cost_insertion = column[i-1].cost + INSERTION_COST;

i.e. You consider wildcard matching for the ambiguous base N (meaning A, C, G or T).

There is also some special treatment of 'N' in query.c but I'm not quite to sure what that is doing.

This may be coping with an N in the FASTQ files, which is the only ambiguity base reported by Illumina.

XLZH commented 10 months ago

I currently do not plan to refactor it to support IUPAC codes, as it may involve a substantial amount of code refactoring. Thank you for suggestion!

best, xiaolong zhang