filter_non_conversion for NOMe-seq?

Hi @TimaLagunov

As it stands, filter_non_conversion only looks at the methylation calls as such, and does not re-examine the genomic context itself (something that is done by coverage2cytosine later on when --nome-seq is specifified, even though at this point this is done per position and not per-read).

I would probably argue thought that adding this feature would be a major piece of work; this is because to determine the genomic context you would kind of need to re-implement most of the code of the methylation extractor (keeping track of insertions/deletions and so on) as well as coverage2cytosine (for the NOMe-seq filtering).

If you really wanted to go down that route I would probably rather do the following:

1) take the (deduplicated) BAM files and proceed with the methylation extraction. Save the non-CG files. 2) proceed with coverage file generation (--CX) 3) proceed with coverage2cytosine --nome

Step 3 produces a coverage file with all reads in GpC context.

You could now write a script that stores these GpC positions in a data structure (there can probably be several hundred million I would imagine...), and then re-process the non-CG cytosine output files from Step 1. You could then keep track of the number of methylated calls in non-CG AND non-GpC context, and decide to filter these reads out if and when the number exceeds a certain percentage cytosines in the total read (preferred), or an absolute of calls (e.g. 3, more crude but could also do a similar job). I would imagine this would probably take a decent amount of RAM, but it can be done.

I don't think anyone at our institute has ever looked at this (or requested it), so for the time being I would say we haven't been actively working on including it as a new feature...

FelixKrueger / Bismark

filter_non_conversion for NOMe-seq? #469