FelixKrueger / Bismark

A tool to map bisulfite converted sequence reads and determine cytosine methylation states
http://felixkrueger.github.io/Bismark/
GNU General Public License v3.0
381 stars 101 forks source link

Inconsistent bismark bam output between different --multicore setting #483

Closed yangdingyangding closed 2 years ago

yangdingyangding commented 2 years ago

Hello,

Thank you for your developing bismark for the community. We, however, observed inconsistent bismark bam output between different --multicore setting. These bams would lead to ~0.2% difference in the bismark_methylation_extractor output. Could you please take a look on this? Thanks in advance!

We have used bismark v0.23.0 with bowtie2 2.4.5.

FelixKrueger commented 2 years ago

Hi there, could you please make available a minimal test case, and provide all the commands involved in your workflow. Cheers, Felix

yangdingyangding commented 2 years ago

Hi there, could you please make available a minimal test case, and provide all the commands involved in your workflow. Cheers, Felix

Hi,

We tested the pipeline (bismark v0.23.1, samtools v1.11 and bowtie2 v2.4.5; the hg38.fa genome is indexed by bismark_genome_preparation --parallel 4 --verbose --bowtie2 --path_to_aligner ${bowtie2Dir} genome_dir ) on the publicly avaliable dataset SRR12545849 with the following script run.bismark.sh , which accepts a single parameter as --multicore:

#!/bin/sh

## $1 is the no. of threads                                                                                                                              mkdir -p output.$1 temp.$1 log.$1 methylation.call.$1

bismark --multicore $1 \
        --non_directional --unmapped \
        --nucleotide_coverage --gzip \
        --fastq --bowtie2 \
        --output_dir output.$1 \
        --temp_dir temp.$1 \
        genome_dir \
        -1 test.rrbs/SRR12545849/SRR12545849_1.fastq.gz -2 test.rrbs/SRR12545849/SRR12545849_2.fastq.gz \
        log.$1

samtools sort output.$1/SRR12545849_1_bismark_bt2_pe.bam -o output.$1/SRR12545849_1_bismark_bt2_pe.sorted.bam -@ 5

picard MarkDuplicates -I output.$1/SRR12545849_1_bismark_bt2_pe.sorted.bam -O output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.bam -M output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.metrics -REMOVE_DUPLICATES true

samtools sort -n output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.bam -o output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.unsorted.bam -@ 5

bismark_methylation_extractor -p --multicore $1 --comprehensive \
                              --bedGraph --counts --buffer_size 50% --report --cytosine_report \
                              --gzip \
                              --genome_folder genome_dir \
                              -o methylation.call.$1 \
                              output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.unsorted.bam

We tested cases with --multicore ranging from 1 to 5:

bash run.bismark.sh 1
bash run.bismark.sh 2
bash run.bismark.sh 3
bash run.bismark.sh 4
bash run.bismark.sh 5

We then computed the inconsistency of each final output file between two different --multicore settings, defined as (the count of non-overlapping lines between the two) / (line count of all non-redundant lines between the two) * 100. Below is an example code for comparing CHG output files between different multicore settings:

import numpy as np
import pandas as pd

context_names = ["id", "strand", "chr", "count", "tag"]
CHG_df_list = []

for i in range(1,6):
    CHG_df_list.append(pd.read_csv("methylation.call." + str(i) + "/CHG_context_SRR12545849_1_bismark_bt2_pe.rmdup.unsorted.txt", delimiter="\t", skipro\ws=1, names=context_names))

CHG_diff_matrix = np.zeros((5, 5), dtype=float)
for i in range(5):
    for j in range(i + 1, 5):
        CHG_outer = pd.merge(CHG_df_list[i], CHG_df_list[j], on=context_names, how='outer')
        CHG_inner = pd.merge(CHG_df_list[i], CHG_df_list[j], on=context_names, how='inner')
        CHG_diff_matrix[i][j] = (CHG_outer.shape[0] - CHG_inner.shape[0]) / CHG_outer.shape[0] * 100

diff_colnames = ['1', '2', '3', '4', '5']
CHG_diff_df = pd.DataFrame(CHG_diff_matrix, columns=diff_colnames)

CHG_diff_df.to_csv("diff/CHG.diff.matrix.csv", header=True, index=False)

Below are the inconsistency table for each output file:

CHG_context (CHG_context_SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.txt.gz)
| Multicore |   2   |   3   |   4   |   5   |
|:---------:|:-----:|:-----:|:-----:|:-----:|
|     1     | 0.790 | 0.956 | 1.168 | 1.154 |
|     2     |       | 1.246 | 0.706 | 1.340 |
|     3     |       |       | 1.360 | 1.352 |
|     4     |       |       |       | 1.386 |

CHH_context (CHH_context_SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.txt.gz)
| Multicore |   2   |   3   |   4   |   5   |
|:---------:|:-----:|:-----:|:-----:|:-----:|
|     1     | 0.758 | 0.924 | 1.150 | 1.098 |
|     2     |       | 1.224 | 0.702 | 1.246 |
|     3     |       |       | 1.370 | 1.324 |
|     4     |       |       |       | 1.344 |

CpG_context (CpG_context_SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.txt.gz)
| Multicore |   2   |   3   |   4   |   5   |
|:---------:|:-----:|:-----:|:-----:|:-----:|
|     1     | 0.774 | 0.838 | 1.036 | 1.016 |
|     2     |       | 1.104 | 0.620 | 1.244 |
|     3     |       |       | 1.194 | 1.202 |
|     4     |       |       |       | 1.258 |

bedGraph (SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.bedGraph.gz)
| Multicore |   2   |   3   |   4   |    5   |
|:---------:|:-----:|:-----:|:-----:|:------:|
|     1     | 0.004 | 0.006 | 0.004 |  0.002 |
|     2     |       | 0.002 | 0.004 |  0.006 |
|     3     |       |       | 0.002 |  0.008 |
|     4     |       |       |       |  0.006 |

cov (SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.bismark.cov.gz)
| Multicore |   2   |   3   |   4   |    5   |
|:---------:|:-----:|:-----:|:-----:|:------:|
|     1     | 0.004 | 0.006 | 0.004 |  0.002 |
|     2     |       | 0.002 | 0.004 |  0.006 |
|     3     |       |       | 0.002 |  0.008 |
|     4     |       |       |       |  0.006 |

M-bias (SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.M-bias.txt)
| Multicore |   2  |   3  |   4  |   5  |
|:---------:|:----:|:----:|:----:|:----:|
|     1     | 15.2 | 30.6 | 20.0 | 22.6 |
|     2     |      | 28.6 | 11.5 | 13.8 |
|     3     |      |      | 31.8 | 32.6 |
|     4     |      |      |      | 19.5 |
FelixKrueger commented 2 years ago

Thanks for providing an example. I have now downloaded the file in question and did a few tests myself.

As a preliminary summary I would suggest that the Bismark mapping is perfectly consistent between different runs, the differences are introduced by introducing several sort and Picard steps along the way.

Here are a few more details:

A typical alignment command would be

bismark --multicore $1 --genome /genome_dir/ -1 SRR12545849_GSM4753395_B3_scRRBS_1_val_1.fq.gz -2 SRR12545849_GSM4753395_B3_scRRBS_2_val_2.fq.gz

I have then trialled --parallel 1, --parallel 2 and --parallel 3 (the same as multicore). The only thing that changes is the execution time, as expected:

diff parallel 1 vs parallel 3:

diff SRR12545849_GSM4753395_B3_scRRBS_parallel_1_bismark_bt2_PE_report.txt SRR12545849_GSM4753395_B3_scRRBS_parallel_3_bismark_bt2_PE_report.txt
42c42
< Bismark completed in 0d 0h 6m 43s
---
> Bismark completed in 0d 0h 2m 37s
42c42

diff parallel 2 vs parallel 3:

diff SRR12545849_GSM4753395_B3_scRRBS_parallel_2_bismark_bt2_PE_report.txt SRR12545849_GSM4753395_B3_scRRBS_parallel_3_bismark_bt2_PE_report.txt
< Bismark completed in 0d 0h 3m 12s
---
> Bismark completed in 0d 0h 2m 37s

So whatever caused the discrepancies must be introduced at subsequent steps you carried out using third party programs.

deduplicate_bismark SRR12545849_GSM4753395_B3_scRRBS_parallel_1_GRCh38_bismark_bt2_pe.bam

Again, I would probably advise not to do this even though it is mentioned for the GEO entry.

TLDR: Reads should be trimmed before mapping. The sort-by-position, mark duplicates with picard, re-sort by seqname workflow is likely to introduce errors on several levels, apart from the fact that --non_directional mapping and deduplication are not advised for this dataset in the first place. Attached is a MultiQC report of the file in questions for the 3 different mapping modes.

I hope this helps.

multiqc_report.html.zip

yangdingyangding commented 2 years ago

Thank you for your detailed answer and analysis!

We noticed that you have already provided a protocol at the bismark homepage (https://www.epigenesys.eu/en/protocols/bio-informatics/483-quality-control-trimming-and-alignment-of-bisulfite-seq-data-prot-57), where you have mentioned not removing duplicates for RRBS samples (but did not mention using deduplicate_bismark instead of third-party tools). Is there any additional resource we could refer to for better design of the analysis pipeline?

FelixKrueger commented 2 years ago

I don't think there is a specific resource, but I here is a guide to RRBS data we have written a while ago: RRBS Guide

This is the pipeline we are using internally for RRBS data (taken from this Nextflow pipeline)

SYNOPSIS:
    This workflow runs a reduced-representation Bisulfite-seq (RRBS) processing pipeline on FastQ files. This includes QC,
    quality-/adapter- and RRBS-trimming, contamination QC (post-trimming), alignments to a genome using Bismark,
    NO deduplication, methylation extraction, coverage file generation, and finally generates an aggregate MultiQC report.
    The workflow is suitable for any RRBS sequencing experiment using MspI. Here is a graphical representation of the workflow:

    --- FastQC
    --- Trim Galore
        |
        --- FastQ Screen
        --- FastQC
        --- Bismark 
            |
            --- Bismark methylation extraction             
    --- bismark2report*
    --- bismark2summary*
    --- MultiQC*

And finally, here are some slides and exercises we used for a methylation data analysis training course. I hope this will help?

yangdingyangding commented 2 years ago

Great, we'll try these and let you know if we have any further problems. Thanks again for your help!