Closed TimaLagunov closed 2 years ago
Hi @TimaLagunov
As it stands, filter_non_conversion
only looks at the methylation calls as such, and does not re-examine the genomic context itself (something that is done by coverage2cytosine
later on when --nome-seq
is specifified, even though at this point this is done per position and not per-read).
I would probably argue thought that adding this feature would be a major piece of work; this is because to determine the genomic context you would kind of need to re-implement most of the code of the methylation extractor (keeping track of insertions/deletions and so on) as well as coverage2cytosine (for the NOMe-seq filtering).
If you really wanted to go down that route I would probably rather do the following:
1) take the (deduplicated) BAM files and proceed with the methylation extraction. Save the non-CG files.
2) proceed with coverage file generation (--CX
)
3) proceed with coverage2cytosine --nome
Step 3 produces a coverage file with all reads in GpC context
.
You could now write a script that stores these GpC positions in a data structure (there can probably be several hundred million I would imagine...), and then re-process the non-CG cytosine output files from Step 1. You could then keep track of the number of methylated calls in non-CG AND non-GpC context, and decide to filter these reads out if and when the number exceeds a certain percentage cytosines in the total read (preferred), or an absolute of calls (e.g. 3, more crude but could also do a similar job). I would imagine this would probably take a decent amount of RAM, but it can be done.
I don't think anyone at our institute has ever looked at this (or requested it), so for the time being I would say we haven't been actively working on including it as a new feature...
Hi, Felix!
Thanks for the awesome tool! I'm using it for NOMe-seq experiment data analysis and I want to reduce the amount of mistakes. In IGV I see that for some reads the bisulfite conversion doesn't work (all C aren't converted regardless of the context). So I want to use _filter_nonconversion to filter reads with >3 methylated C in non-CG AND non-GC contexts. But your tool provides only filtering non-CG context. Is there any solution for this problem?
*Maybe it's already in plans for next Bismark version? Or maybe there is an easy way to add methylation call for GC contexts to bam file?