FelixKrueger / SNPsplit

Allele-specific alignment sorting
http://felixkrueger.github.io/SNPsplit/
GNU General Public License v3.0
52 stars 20 forks source link

Keeping reads at CG SNPs using Bismark #17

Closed jrmerritt6 closed 6 years ago

jrmerritt6 commented 6 years ago

I'm working on a bisulfite sequencing project with heterozygotes that have SNPs in CG sites. These SNPs in CpG sites are of primary interest to the research question. Some, but not all, of these SNPs are C/T. However, my bedgraph data never have any methylation calls at SNPs in CG sites when using the bismark2bedGraph function, which is because I was aligning reads to an N-masked genome.

I have a well-defined list of 100% fixed SNPs for genome1 and genome2. I considered aligning the bam files to genome1 then using SNPsplit (as discussed under "Utilization of SNP positions and allele assignment of bisulfite reads" in the User Guide), but then there is a mapping bias toward genome1.

Is there a way to keep methylation calls, while avoiding mapping bias, in the SNPs?

My current workflow is to align reads to an N-masked reference genome using Bismark, then the bam files are used as input for SNPsplit to assign reads to haplotype, and I then extract methylation:

SNPsplit --snp_file <SNP.file.gz> <inputfiles>
bismark_methylation_extractor <val_1_bismark_bt2_pe.genome1.bam>
bismark2bedGraph <CpG_OT_R1_val_1_bismark_bt2_pe.genome1.txt> -o <R1>
FelixKrueger commented 6 years ago

The reason for entirely removing C/T SNP positions in a bisulfite experiment is obviously that you cannot trust the result at all. Let's assume genome 1 would normally be a C, but due to its methylation state you might see it as either a C (methylated) or T (unmethylated) after bisulfite treatment. Genome 2 would always have a T at that position in any case. While a C in a read would have to be a (methylated) genome 1 read, there is no way to discriminate between a genome 2 or unmethylated genome 1 read, unless the read contains additional informative SNPs at the same time. In other words, the methylation state might not be a real methylation state but could be confounding signal coming simply from the underlying genomic sequence (that just looks like non-methylation).

If you still wanted to include reads covering those C/T positions (I am not sure you should be wanting that...) and I am not mistaken, you should be able to pre-filter your list of SNPs to exclude C/T SNPs, and then run the N-masking and indexing again. This way you wouldn't N-mask and thus include all C/T positions, and therefore also get 'methylation calls'. You should not observe any mapping bias in this case because Cs are in-silico C-to-T converted to Ts for the mapping procedure anyway. Since those positions were not N-masked they would also not participate in allele (or haplotype) assignment, but please be aware that the 'methylation' you observe at these positions might be confounded by genetics, i.e. the T SNP.

Do I make sense?

jrmerritt6 commented 6 years ago

Yes, I do see your point about the confound of looking at methylation at C/T SNPs. The good thing is that I'm studying F1 hybrids, and we have a very good list of non-CG fixed SNPs to assign reads to haplotype, so dissecting out methylation vs. SNP is possible with my particular system. The approach you suggested, excluding C/T SNPs, worked for me. Thank you!