FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
386 stars 101 forks source link

filter_non_conversion for NOMe-seq? #469

Closed TimaLagunov closed 2 years ago

TimaLagunov commented 2 years ago

Hi, Felix!

Thanks for the awesome tool! I'm using it for NOMe-seq experiment data analysis and I want to reduce the amount of mistakes. In IGV I see that for some reads the bisulfite conversion doesn't work (all C aren't converted regardless of the context). So I want to use _filter_nonconversion to filter reads with >3 methylated C in non-CG AND non-GC contexts. But your tool provides only filtering non-CG context. Is there any solution for this problem?

*Maybe it's already in plans for next Bismark version? Or maybe there is an easy way to add methylation call for GC contexts to bam file?

FelixKrueger commented 2 years ago

Hi @TimaLagunov

As it stands, filter_non_conversion only looks at the methylation calls as such, and does not re-examine the genomic context itself (something that is done by coverage2cytosine later on when --nome-seq is specifified, even though at this point this is done per position and not per-read).

I would probably argue thought that adding this feature would be a major piece of work; this is because to determine the genomic context you would kind of need to re-implement most of the code of the methylation extractor (keeping track of insertions/deletions and so on) as well as coverage2cytosine (for the NOMe-seq filtering).

If you really wanted to go down that route I would probably rather do the following:

1) take the (deduplicated) BAM files and proceed with the methylation extraction. Save the non-CG files. 2) proceed with coverage file generation (--CX) 3) proceed with coverage2cytosine --nome

Step 3 produces a coverage file with all reads in GpC context.

You could now write a script that stores these GpC positions in a data structure (there can probably be several hundred million I would imagine...), and then re-process the non-CG cytosine output files from Step 1. You could then keep track of the number of methylated calls in non-CG AND non-GpC context, and decide to filter these reads out if and when the number exceeds a certain percentage cytosines in the total read (preferred), or an absolute of calls (e.g. 3, more crude but could also do a similar job). I would imagine this would probably take a decent amount of RAM, but it can be done.

I don't think anyone at our institute has ever looked at this (or requested it), so for the time being I would say we haven't been actively working on including it as a new feature...