KatyBrown / CIAlign

MIT License
117 stars 9 forks source link

Generating a Consensus Sequence like Geneious with Threshold #35

Open bigben446 opened 3 years ago

bigben446 commented 3 years ago

https://assets.geneious.com/manual/2020.1/static/GeneiousManualse43.html

Threshold settings The Threshold determines which base in called in the consensus, and can be set to a percentage, or by using the quality scores on the reads. IUPAC ambiguity codes (such as R for an A or G nucleotide) are counted as fractional support for each nucleotide in the ambiguity set (A and G, in this case), thus two rows with R are counted the same as one row with A and one row with G. When more than one nucleotide is necessary to reach the desired threshold, this is represented by the best-fit ambiguity symbol in the consensus; for protein sequences, this will always be an X. For example, assume a column contains 6 A’s, 3 G’s and 1 T. If the consensus threshold is set to 60% or below, then the consensus will be A. If the consensus threshold is set to between 60% and 90%, then the consensus will be R. If the consensus threshold is set to over 90%, then the consensus will be D. In the case of ties, either all or none of the involved residues will be selected. For example, if the above case instead had 6 A’s, 2 G’s and 2 T’s, then for a consensus threshold of 60% or below, an A will be called. Above a threshold of 60%, a D will be called.

KatyBrown commented 3 years ago

Thanks very much for your suggestion - we're planning to introduce other types of consensus in a future release and adding this type of threshold based approach would be a good idea.