FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
391 stars 101 forks source link

Removal of unspecific reads when using enrichment protocols #211

Closed SteffanChristiansen closed 5 years ago

SteffanChristiansen commented 6 years ago

Dear Felix,

I am using Methyl-Seq Target enrichment system (Agilent) to prepare my samples for bisulfite sequencing, which allows me to investigate specific regions. The kit uses positive probes to capture negative stranded DNA fragments. Currently, I am testing Bismark for mapping and methylation calling and have the following problem.

Approximately half of the reads from my experiments are unspecific and originates either from regions not covered by probes or alternatively from the positive strand in regions covered by probes. When I use other mapping programs the stats is more or less the same.

After the mapping procedure, I remove unspecific reads outside the probe covered regions by using samtools: samtools intersect –a input.bam –b probe_covered_regions.bed > output.bam

while unspecific reads from the positive strand are removed by following commands: samtools view -b -f 83 input.bam > 83_flagged_reads.bam samtools view -b -f 163 input.bam > 163_flagged_reads.bam samtools merge 83_flagged_reads.bam 163_flagged_reads.bam output.bam

When I try to run methylation extraction on the output from samtools intersect or from the flag sorted output, I get the same error after running (however, with different reads in error message):

/mnt/NGS01/biss/opt/Bismark_v0.19.0/bismark_methylation_extractor bismark_methylation_extraction/output.bam -p --output bismark_methylation_extraction/

FATAL ERROR: The IDs of Read 1 (M03048:173:000000000-BK74P:1:2109:12864:26815_1:N:0:1) and Read 2 (M03048:173:000000000-BK74P:1:2109:12878:17490_1:N:0:1) are not the same. This might be the result of sorting the BAM files by chromosomal position or merging several files with Samtools sort, and this is not compatible with correct methylation extraction. Please use an unsorted file instead or sort the file by name using the command 'samtools sort -n'. Paired-end files may be merged properly (without risking this error) using either 'samtools merge -n' or 'samtools cat'

It seems like the samtools commands interfere with the ability to perform the methylation calling which have been described in other posts as well. I have tried to run samtools sort –n, but it does solve the issue.

Do you know any tools compatible with Bismark for the sorting procedure or for the selection of reads from only one strand?

Thank you in advance, Steffan Christiansen

FelixKrueger commented 6 years ago

Hi Steffan,

I suppose the problem in your case are scenarios like this:

                                   ======================================= Region of interest
111111111111111111111111111>>>
                                     <<<2222222222222222222222222222222222

Where one read overlaps the region you wanted to enrich for (here Read 2, but the other end doesn't (Read 1). For paired-end data, Bismark requires both R1 and R2 for the methylation extraction (so that read overlap extraction works well), but if some reads are removed it will prevent you from running it, and die. You do have several options here though:

1) The easiest option from a Bismark point of view is certainly to run the methylation extraction, and coverage file conversion, straight on the BAM file as it is produced by Bismark. I would then post filter on the level of methylation calls rather on the read level.

2) Another easy option would be to use SeqMonk for this. You could import Bismark coverage files as they are, and add your capture probe regions as annotation track. You could then either focus on analysing only data in these regions, or even physically remove the data from other regions of the genome (by doing Data > Import > Visible Datastore, and then ticking the option Excluding reads outside "Capture Probes").

3) A third option might require some scripting: If you wanted to follow the Samtools region approach, you should be able to read the BAM file with regions you want to keep, and store the Read IDs in a data structure. You could then re-read the original Bismark file and either keep or discard both reads of a pair depending on whether they overlap with the regions of interest. In that way you should end up only with valid read pairs.

4) Another option that might work would be to take your filtered, name sorted BAM file, and write a quick script that will go through it and remove all reads where only one read of a pair is present. This should equally yield a BAm file with only valid read pairs.

5) And finally you should be able to treat the filtered (no longer valid) paired-end files as single-end files. (flag -s). This has two disadvantages compared to paired-end extraction though: 1) the overlap detection between R1 and R2 will no longer work, so overlapping portions of read pair will receive some extra methylation calls. And 2), the strand identity will look different, i.e. you will then also see CTOT and CTOB alignments from directional data etc. The methylation calls will be correct (in fact the coverage files should look identical), but the strand origin will look a bit weird.

If I had a choice I would probably pick one of the options 1-4. I hope this helps. Best, Felix

SteffanChristiansen commented 6 years ago

Thanks a lot for your detailed suggestions! Since my scripting skills are a bit limited, solution 1) and 2) may fit best for me.

Are SeqMonk capable of removing/selecting one of the strands to remove unspecific reads originating from the same strand as the probes or will it require some scripting before the methylation calling?

Best, Steffan

FelixKrueger commented 6 years ago

Hi Steffan,

SeqMonk can also import the initial CpG_... context files that are being produced by the methylation extractor, and since these have the strand in their file names (OT/CTOT come from the top strand, OB/CTOB come from the bottom strand) you can then treat or filter them individually. Once you have done what you want to do you merge all strands into a data group, where they will then look exactly like if you started from a .cov(.gz) file in the first place.

Here is a link to the SeqMonk documentation, or the training course. Once you have familiarised yourself with SeqMonk you might want to take a look at visualisation/quntification options specific for methylation data. Hope this helps.

SteffanChristiansen commented 5 years ago

Dear Felix

Just a brief follow-up:

I performed a test of a small region where no read pairs were split using the samtools intersect approach. Using an unsorted file, I am able to run the methylation extraction without any issues, so it seems like you were right about the scenario.

Additionally, I tried 2) and 5) and they works as you predicted.

Thanks a lot for your help.

Best, Steffan