Inconsistent bismark bam output between different --multicore setting

yangdingyangding commented 2 years ago

Hello,

Thank you for your developing bismark for the community. We, however, observed inconsistent bismark bam output between different --multicore setting. These bams would lead to ~0.2% difference in the bismark_methylation_extractor output. Could you please take a look on this? Thanks in advance!

We have used bismark v0.23.0 with bowtie2 2.4.5.

FelixKrueger commented 2 years ago

Hi there, could you please make available a minimal test case, and provide all the commands involved in your workflow. Cheers, Felix

yangdingyangding commented 2 years ago

Hi there, could you please make available a minimal test case, and provide all the commands involved in your workflow. Cheers, Felix

Hi,

We tested the pipeline (bismark v0.23.1, samtools v1.11 and bowtie2 v2.4.5; the hg38.fa genome is indexed by bismark_genome_preparation --parallel 4 --verbose --bowtie2 --path_to_aligner ${bowtie2Dir} genome_dir ) on the publicly avaliable dataset SRR12545849 with the following script run.bismark.sh , which accepts a single parameter as --multicore:

#!/bin/sh

## $1 is the no. of threads                                                                                                                              mkdir -p output.$1 temp.$1 log.$1 methylation.call.$1

bismark --multicore $1 \
        --non_directional --unmapped \
        --nucleotide_coverage --gzip \
        --fastq --bowtie2 \
        --output_dir output.$1 \
        --temp_dir temp.$1 \
        genome_dir \
        -1 test.rrbs/SRR12545849/SRR12545849_1.fastq.gz -2 test.rrbs/SRR12545849/SRR12545849_2.fastq.gz \
        log.$1

samtools sort output.$1/SRR12545849_1_bismark_bt2_pe.bam -o output.$1/SRR12545849_1_bismark_bt2_pe.sorted.bam -@ 5

picard MarkDuplicates -I output.$1/SRR12545849_1_bismark_bt2_pe.sorted.bam -O output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.bam -M output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.metrics -REMOVE_DUPLICATES true

samtools sort -n output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.bam -o output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.unsorted.bam -@ 5

bismark_methylation_extractor -p --multicore $1 --comprehensive \
                              --bedGraph --counts --buffer_size 50% --report --cytosine_report \
                              --gzip \
                              --genome_folder genome_dir \
                              -o methylation.call.$1 \
                              output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.unsorted.bam

We tested cases with --multicore ranging from 1 to 5:

bash run.bismark.sh 1
bash run.bismark.sh 2
bash run.bismark.sh 3
bash run.bismark.sh 4
bash run.bismark.sh 5

We then computed the inconsistency of each final output file between two different --multicore settings, defined as (the count of non-overlapping lines between the two) / (line count of all non-redundant lines between the two) * 100. Below is an example code for comparing CHG output files between different multicore settings:

import numpy as np
import pandas as pd

context_names = ["id", "strand", "chr", "count", "tag"]
CHG_df_list = []

for i in range(1,6):
    CHG_df_list.append(pd.read_csv("methylation.call." + str(i) + "/CHG_context_SRR12545849_1_bismark_bt2_pe.rmdup.unsorted.txt", delimiter="\t", skipro\ws=1, names=context_names))

CHG_diff_matrix = np.zeros((5, 5), dtype=float)
for i in range(5):
    for j in range(i + 1, 5):
        CHG_outer = pd.merge(CHG_df_list[i], CHG_df_list[j], on=context_names, how='outer')
        CHG_inner = pd.merge(CHG_df_list[i], CHG_df_list[j], on=context_names, how='inner')
        CHG_diff_matrix[i][j] = (CHG_outer.shape[0] - CHG_inner.shape[0]) / CHG_outer.shape[0] * 100

diff_colnames = ['1', '2', '3', '4', '5']
CHG_diff_df = pd.DataFrame(CHG_diff_matrix, columns=diff_colnames)

CHG_diff_df.to_csv("diff/CHG.diff.matrix.csv", header=True, index=False)

Below are the inconsistency table for each output file:

CHG_context (CHG_context_SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.txt.gz)
| Multicore |   2   |   3   |   4   |   5   |
|:---------:|:-----:|:-----:|:-----:|:-----:|
|     1     | 0.790 | 0.956 | 1.168 | 1.154 |
|     2     |       | 1.246 | 0.706 | 1.340 |
|     3     |       |       | 1.360 | 1.352 |
|     4     |       |       |       | 1.386 |

CHH_context (CHH_context_SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.txt.gz)
| Multicore |   2   |   3   |   4   |   5   |
|:---------:|:-----:|:-----:|:-----:|:-----:|
|     1     | 0.758 | 0.924 | 1.150 | 1.098 |
|     2     |       | 1.224 | 0.702 | 1.246 |
|     3     |       |       | 1.370 | 1.324 |
|     4     |       |       |       | 1.344 |

CpG_context (CpG_context_SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.txt.gz)
| Multicore |   2   |   3   |   4   |   5   |
|:---------:|:-----:|:-----:|:-----:|:-----:|
|     1     | 0.774 | 0.838 | 1.036 | 1.016 |
|     2     |       | 1.104 | 0.620 | 1.244 |
|     3     |       |       | 1.194 | 1.202 |
|     4     |       |       |       | 1.258 |

bedGraph (SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.bedGraph.gz)
| Multicore |   2   |   3   |   4   |    5   |
|:---------:|:-----:|:-----:|:-----:|:------:|
|     1     | 0.004 | 0.006 | 0.004 |  0.002 |
|     2     |       | 0.002 | 0.004 |  0.006 |
|     3     |       |       | 0.002 |  0.008 |
|     4     |       |       |       |  0.006 |

cov (SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.bismark.cov.gz)
| Multicore |   2   |   3   |   4   |    5   |
|:---------:|:-----:|:-----:|:-----:|:------:|
|     1     | 0.004 | 0.006 | 0.004 |  0.002 |
|     2     |       | 0.002 | 0.004 |  0.006 |
|     3     |       |       | 0.002 |  0.008 |
|     4     |       |       |       |  0.006 |

M-bias (SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.M-bias.txt)
| Multicore |   2  |   3  |   4  |   5  |
|:---------:|:----:|:----:|:----:|:----:|
|     1     | 15.2 | 30.6 | 20.0 | 22.6 |
|     2     |      | 28.6 | 11.5 | 13.8 |
|     3     |      |      | 31.8 | 32.6 |
|     4     |      |      |      | 19.5 |

FelixKrueger commented 2 years ago

Thanks for providing an example. I have now downloaded the file in question and did a few tests myself.

As a preliminary summary I would suggest that the Bismark mapping is perfectly consistent between different runs, the differences are introduced by introducing several sort and Picard steps along the way.

Here are a few more details:

The dataset you linked is a scRRBS sample, so you should not align the data as is, but trim it with Trim Galore (probably with --rrbs first)
the datasets is directional, so you must not use --non_directional

A typical alignment command would be

bismark --multicore $1 --genome /genome_dir/ -1 SRR12545849_GSM4753395_B3_scRRBS_1_val_1.fq.gz -2 SRR12545849_GSM4753395_B3_scRRBS_2_val_2.fq.gz

I have then trialled --parallel 1, --parallel 2 and --parallel 3 (the same as multicore). The only thing that changes is the execution time, as expected:

diff parallel 1 vs parallel 3:

diff SRR12545849_GSM4753395_B3_scRRBS_parallel_1_bismark_bt2_PE_report.txt SRR12545849_GSM4753395_B3_scRRBS_parallel_3_bismark_bt2_PE_report.txt
42c42
< Bismark completed in 0d 0h 6m 43s
---
> Bismark completed in 0d 0h 2m 37s
42c42

diff parallel 2 vs parallel 3:

diff SRR12545849_GSM4753395_B3_scRRBS_parallel_2_bismark_bt2_PE_report.txt SRR12545849_GSM4753395_B3_scRRBS_parallel_3_bismark_bt2_PE_report.txt
< Bismark completed in 0d 0h 3m 12s
---
> Bismark completed in 0d 0h 2m 37s

So whatever caused the discrepancies must be introduced at subsequent steps you carried out using third party programs.

As this data is RRBS, you should probably not carry out deduplication at all. If you still wanted to do it, you could just use Bismark's deduplicator, like so:

deduplicate_bismark SRR12545849_GSM4753395_B3_scRRBS_parallel_1_GRCh38_bismark_bt2_pe.bam

Again, I would probably advise not to do this even though it is mentioned for the GEO entry.

And finally, samtools sort -n by position does not accurately sort reads that are mapped in a non-directional manner back together.

TLDR: Reads should be trimmed before mapping. The sort-by-position, mark duplicates with picard, re-sort by seqname workflow is likely to introduce errors on several levels, apart from the fact that --non_directional mapping and deduplication are not advised for this dataset in the first place. Attached is a MultiQC report of the file in questions for the 3 different mapping modes.

I hope this helps.

multiqc_report.html.zip

yangdingyangding commented 2 years ago

Thank you for your detailed answer and analysis!

We noticed that you have already provided a protocol at the bismark homepage (https://www.epigenesys.eu/en/protocols/bio-informatics/483-quality-control-trimming-and-alignment-of-bisulfite-seq-data-prot-57), where you have mentioned not removing duplicates for RRBS samples (but did not mention using deduplicate_bismark instead of third-party tools). Is there any additional resource we could refer to for better design of the analysis pipeline?

FelixKrueger commented 2 years ago

I don't think there is a specific resource, but I here is a guide to RRBS data we have written a while ago: RRBS Guide

This is the pipeline we are using internally for RRBS data (taken from this Nextflow pipeline)

SYNOPSIS:
    This workflow runs a reduced-representation Bisulfite-seq (RRBS) processing pipeline on FastQ files. This includes QC,
    quality-/adapter- and RRBS-trimming, contamination QC (post-trimming), alignments to a genome using Bismark,
    NO deduplication, methylation extraction, coverage file generation, and finally generates an aggregate MultiQC report.
    The workflow is suitable for any RRBS sequencing experiment using MspI. Here is a graphical representation of the workflow:

    --- FastQC
    --- Trim Galore
        |
        --- FastQ Screen
        --- FastQC
        --- Bismark 
            |
            --- Bismark methylation extraction             
    --- bismark2report*
    --- bismark2summary*
    --- MultiQC*

And finally, here are some slides and exercises we used for a methylation data analysis training course. I hope this will help?

yangdingyangding commented 2 years ago

Great, we'll try these and let you know if we have any further problems. Thanks again for your help!

FelixKrueger / Bismark

Inconsistent bismark bam output between different --multicore setting #483