Open Dingersrun opened 3 years ago
0.1 is fairly reasonable. Please note that you end up filtering out any sequence artifacts and stuff like that, which will randomly appear with longer reads.
Thanks for the reply! My other question is which count, the count of total reads at this base or count of reads on the non-C strand, is the fraction of non-G reads in the non-C strands compared against?
It's the count on the non-C strand, since it's easier to assess whether there's a variant using it.
I wanted to filter out all the possible SNPs, both homozygous and heterozygous. My understanding is that if no SNPs, no non-G should be expected, and thus I set this to 0, then almost all the CpG sites were excluded (22Million out of 23 Million were excluded). When I set it to 0.1, most of the CpG sites are retained. Is this filtering so harsh? Do you have any suggestions about filtering the SNPs? --maxVariantFrac means the fraction of Non-G on the opposite strand of C compared with the coverage at this given base or only the coverage of the opposite strand of C? For instance, 10 reads from the C strand and 10 reads from the non-C strand, there are 3 non-G reads from the non-C strand, then the variant fraction here is 0.3 or 0.15? Thanks a lot :)