berman-lab / ymap

YMAP - Yeast Mapping Analysis Pipeline : An online pipeline for the analysis of yeast genomic datasets.
MIT License
6 stars 6 forks source link

Bowtie2 settings #12

Open darrenabbey opened 9 years ago

darrenabbey commented 9 years ago

Currently bowtie2 is run with the --very-sensitive flag.

This generates a best fitting of all reads, which is ideal for CNV analysis. This setting is not necessarily ideal for SNP analysis, as it generates what are likely to be spurious SNP data.

It might be advisable to do two alignments, once with the flag for CNV analysis and once without the flag for SNP analysis.

vladimirg commented 8 years ago

@darrenabbey , why would this generate spurious SNP calls, if the alignments are the most correct?

darrenabbey commented 8 years ago

The sequencing technologies producing whole genome sequence datasets have an intrinsic error rate in base identification. This form of error would only rarely alter the positions of reads, so it would not impact CNV analysis. What the error will do is introduce spurious base heterogeneity not corresponding to what is actually happening in the genome. This heterogeneity has the potential to be interpreted by the YMAP algorithms as potential SNPs.

If you removed the --very-sensitive flag, bowtie2 would discard reads which have poorer quality scores. For CNV analyses, this would result in real information loss and apparent variations in copy number that are not corresponding to what is happening in the genome. For SNP analysis, discarding these poorer reads would help to filter out sequencing error and so would result in a higher signal-to-noise ratio in the the SNP ratio data that YMAP uses for figure construction.

vladimirg commented 8 years ago

If I understand correctly, alignments are given a mapping quality (MQ) score, which we can filter on before doing the SNP calls (BTW, I don't think Ymap takes MQ scores into account today at all - or did I just miss it?). However, if the error rate is (a) rare, (b) random and (c) affects only single nucleotides, then shouldn't Ymap filter these out simply by virtue of requiring a read depth of 30 and a minimum presence of 25% of the allele?

darrenabbey commented 8 years ago

YMAP doesn't use quality scores. The percentage ratio cutoffs depend on the inferred copy numbers for the region.

The errors aren't random in the mathematical sense. Instead they tend to recur in similar sequences that are difficult for the sequencing tech, due to physical/chemical constraints. I haven't done a precise characterization of this, but some hints have led me to this intuition. Independent analyses by myself and Joshua Baller showed that raw sequence data often appeared to show 1:2 ratios of SNPs in regions that were very low in SNP data, implying a triploid copy number even when the overwhelming evidence from CNV analysis said the region was diploid. We never worked out a detailed theoretical argument for why this would happen.