dpryan79 / MethylDackel

A (mostly) universal methylation extractor for BS-seq experiments.
MIT License
164 stars 44 forks source link

maxVariantFrac #123

Open Dingersrun opened 3 years ago

Dingersrun commented 3 years ago

I wanted to filter out all the possible SNPs, both homozygous and heterozygous. My understanding is that if no SNPs, no non-G should be expected, and thus I set this to 0, then almost all the CpG sites were excluded (22Million out of 23 Million were excluded). When I set it to 0.1, most of the CpG sites are retained. Is this filtering so harsh? Do you have any suggestions about filtering the SNPs? --maxVariantFrac means the fraction of Non-G on the opposite strand of C compared with the coverage at this given base or only the coverage of the opposite strand of C? For instance, 10 reads from the C strand and 10 reads from the non-C strand, there are 3 non-G reads from the non-C strand, then the variant fraction here is 0.3 or 0.15? Thanks a lot :)

dpryan79 commented 3 years ago

0.1 is fairly reasonable. Please note that you end up filtering out any sequence artifacts and stuff like that, which will randomly appear with longer reads.

Dingersrun commented 3 years ago

Thanks for the reply! My other question is which count, the count of total reads at this base or count of reads on the non-C strand, is the fraction of non-G reads in the non-C strands compared against?

dpryan79 commented 3 years ago

It's the count on the non-C strand, since it's easier to assess whether there's a variant using it.