Closed yangdingyangding closed 2 years ago
Hi there, could you please make available a minimal test case, and provide all the commands involved in your workflow. Cheers, Felix
Hi there, could you please make available a minimal test case, and provide all the commands involved in your workflow. Cheers, Felix
Hi,
We tested the pipeline (bismark v0.23.1, samtools v1.11 and bowtie2 v2.4.5; the hg38.fa
genome is indexed by bismark_genome_preparation --parallel 4 --verbose --bowtie2 --path_to_aligner ${bowtie2Dir} genome_dir
) on the publicly avaliable dataset SRR12545849 with the following script run.bismark.sh
, which accepts a single parameter as --multicore
:
#!/bin/sh
## $1 is the no. of threads mkdir -p output.$1 temp.$1 log.$1 methylation.call.$1
bismark --multicore $1 \
--non_directional --unmapped \
--nucleotide_coverage --gzip \
--fastq --bowtie2 \
--output_dir output.$1 \
--temp_dir temp.$1 \
genome_dir \
-1 test.rrbs/SRR12545849/SRR12545849_1.fastq.gz -2 test.rrbs/SRR12545849/SRR12545849_2.fastq.gz \
log.$1
samtools sort output.$1/SRR12545849_1_bismark_bt2_pe.bam -o output.$1/SRR12545849_1_bismark_bt2_pe.sorted.bam -@ 5
picard MarkDuplicates -I output.$1/SRR12545849_1_bismark_bt2_pe.sorted.bam -O output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.bam -M output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.metrics -REMOVE_DUPLICATES true
samtools sort -n output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.bam -o output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.unsorted.bam -@ 5
bismark_methylation_extractor -p --multicore $1 --comprehensive \
--bedGraph --counts --buffer_size 50% --report --cytosine_report \
--gzip \
--genome_folder genome_dir \
-o methylation.call.$1 \
output.$1/SRR12545849_1_bismark_bt2_pe.rmdup.unsorted.bam
We tested cases with --multicore
ranging from 1 to 5:
bash run.bismark.sh 1
bash run.bismark.sh 2
bash run.bismark.sh 3
bash run.bismark.sh 4
bash run.bismark.sh 5
We then computed the inconsistency of each final output file between two different --multicore
settings, defined as (the count of non-overlapping lines between the two) / (line count of all non-redundant lines between the two) * 100. Below is an example code for comparing CHG output files between different multicore settings:
import numpy as np
import pandas as pd
context_names = ["id", "strand", "chr", "count", "tag"]
CHG_df_list = []
for i in range(1,6):
CHG_df_list.append(pd.read_csv("methylation.call." + str(i) + "/CHG_context_SRR12545849_1_bismark_bt2_pe.rmdup.unsorted.txt", delimiter="\t", skipro\ws=1, names=context_names))
CHG_diff_matrix = np.zeros((5, 5), dtype=float)
for i in range(5):
for j in range(i + 1, 5):
CHG_outer = pd.merge(CHG_df_list[i], CHG_df_list[j], on=context_names, how='outer')
CHG_inner = pd.merge(CHG_df_list[i], CHG_df_list[j], on=context_names, how='inner')
CHG_diff_matrix[i][j] = (CHG_outer.shape[0] - CHG_inner.shape[0]) / CHG_outer.shape[0] * 100
diff_colnames = ['1', '2', '3', '4', '5']
CHG_diff_df = pd.DataFrame(CHG_diff_matrix, columns=diff_colnames)
CHG_diff_df.to_csv("diff/CHG.diff.matrix.csv", header=True, index=False)
Below are the inconsistency table for each output file:
CHG_context (CHG_context_SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.txt.gz)
| Multicore | 2 | 3 | 4 | 5 |
|:---------:|:-----:|:-----:|:-----:|:-----:|
| 1 | 0.790 | 0.956 | 1.168 | 1.154 |
| 2 | | 1.246 | 0.706 | 1.340 |
| 3 | | | 1.360 | 1.352 |
| 4 | | | | 1.386 |
CHH_context (CHH_context_SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.txt.gz)
| Multicore | 2 | 3 | 4 | 5 |
|:---------:|:-----:|:-----:|:-----:|:-----:|
| 1 | 0.758 | 0.924 | 1.150 | 1.098 |
| 2 | | 1.224 | 0.702 | 1.246 |
| 3 | | | 1.370 | 1.324 |
| 4 | | | | 1.344 |
CpG_context (CpG_context_SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.txt.gz)
| Multicore | 2 | 3 | 4 | 5 |
|:---------:|:-----:|:-----:|:-----:|:-----:|
| 1 | 0.774 | 0.838 | 1.036 | 1.016 |
| 2 | | 1.104 | 0.620 | 1.244 |
| 3 | | | 1.194 | 1.202 |
| 4 | | | | 1.258 |
bedGraph (SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.bedGraph.gz)
| Multicore | 2 | 3 | 4 | 5 |
|:---------:|:-----:|:-----:|:-----:|:------:|
| 1 | 0.004 | 0.006 | 0.004 | 0.002 |
| 2 | | 0.002 | 0.004 | 0.006 |
| 3 | | | 0.002 | 0.008 |
| 4 | | | | 0.006 |
cov (SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.bismark.cov.gz)
| Multicore | 2 | 3 | 4 | 5 |
|:---------:|:-----:|:-----:|:-----:|:------:|
| 1 | 0.004 | 0.006 | 0.004 | 0.002 |
| 2 | | 0.002 | 0.004 | 0.006 |
| 3 | | | 0.002 | 0.008 |
| 4 | | | | 0.006 |
M-bias (SRR12545849_$i_bismark_bt2_pe.rmdup.unsorted.M-bias.txt)
| Multicore | 2 | 3 | 4 | 5 |
|:---------:|:----:|:----:|:----:|:----:|
| 1 | 15.2 | 30.6 | 20.0 | 22.6 |
| 2 | | 28.6 | 11.5 | 13.8 |
| 3 | | | 31.8 | 32.6 |
| 4 | | | | 19.5 |
Thanks for providing an example. I have now downloaded the file in question and did a few tests myself.
As a preliminary summary I would suggest that the Bismark mapping is perfectly consistent between different runs, the differences are introduced by introducing several sort and Picard steps along the way.
Here are a few more details:
--rrbs
first)--non_directional
A typical alignment command would be
bismark --multicore $1 --genome /genome_dir/ -1 SRR12545849_GSM4753395_B3_scRRBS_1_val_1.fq.gz -2 SRR12545849_GSM4753395_B3_scRRBS_2_val_2.fq.gz
I have then trialled --parallel 1
, --parallel 2
and --parallel 3
(the same as multicore). The only thing that changes is the execution time, as expected:
diff parallel 1 vs parallel 3:
diff SRR12545849_GSM4753395_B3_scRRBS_parallel_1_bismark_bt2_PE_report.txt SRR12545849_GSM4753395_B3_scRRBS_parallel_3_bismark_bt2_PE_report.txt
42c42
< Bismark completed in 0d 0h 6m 43s
---
> Bismark completed in 0d 0h 2m 37s
42c42
diff parallel 2 vs parallel 3:
diff SRR12545849_GSM4753395_B3_scRRBS_parallel_2_bismark_bt2_PE_report.txt SRR12545849_GSM4753395_B3_scRRBS_parallel_3_bismark_bt2_PE_report.txt
< Bismark completed in 0d 0h 3m 12s
---
> Bismark completed in 0d 0h 2m 37s
So whatever caused the discrepancies must be introduced at subsequent steps you carried out using third party programs.
deduplicate_bismark SRR12545849_GSM4753395_B3_scRRBS_parallel_1_GRCh38_bismark_bt2_pe.bam
Again, I would probably advise not to do this even though it is mentioned for the GEO entry.
samtools sort -n
by position does not accurately sort reads that are mapped in a non-directional manner back together. TLDR: Reads should be trimmed before mapping. The sort-by-position, mark duplicates with picard, re-sort by seqname workflow is likely to introduce errors on several levels, apart from the fact that --non_directional
mapping and deduplication are not advised for this dataset in the first place. Attached is a MultiQC report of the file in questions for the 3 different mapping modes.
I hope this helps.
Thank you for your detailed answer and analysis!
We noticed that you have already provided a protocol at the bismark homepage (https://www.epigenesys.eu/en/protocols/bio-informatics/483-quality-control-trimming-and-alignment-of-bisulfite-seq-data-prot-57), where you have mentioned not removing duplicates for RRBS samples (but did not mention using deduplicate_bismark instead of third-party tools). Is there any additional resource we could refer to for better design of the analysis pipeline?
I don't think there is a specific resource, but I here is a guide to RRBS data we have written a while ago: RRBS Guide
This is the pipeline we are using internally for RRBS data (taken from this Nextflow pipeline)
SYNOPSIS:
This workflow runs a reduced-representation Bisulfite-seq (RRBS) processing pipeline on FastQ files. This includes QC,
quality-/adapter- and RRBS-trimming, contamination QC (post-trimming), alignments to a genome using Bismark,
NO deduplication, methylation extraction, coverage file generation, and finally generates an aggregate MultiQC report.
The workflow is suitable for any RRBS sequencing experiment using MspI. Here is a graphical representation of the workflow:
--- FastQC
--- Trim Galore
|
--- FastQ Screen
--- FastQC
--- Bismark
|
--- Bismark methylation extraction
--- bismark2report*
--- bismark2summary*
--- MultiQC*
And finally, here are some slides and exercises we used for a methylation data analysis training course. I hope this will help?
Great, we'll try these and let you know if we have any further problems. Thanks again for your help!
Hello,
Thank you for your developing bismark for the community. We, however, observed inconsistent bismark bam output between different --multicore setting. These bams would lead to ~0.2% difference in the bismark_methylation_extractor output. Could you please take a look on this? Thanks in advance!
We have used bismark v0.23.0 with bowtie2 2.4.5.