MethylScore is a pipeline to call differentially methylated regions between samples obtained from Whole Genome Bisulfite Sequencing (WGBS).
MethylScore starts from bam files containing alignments of bisulfite-sequenced reads to a reference genome, produced by bisulfite read alignment tools such as bismark and _bwa_meth. Alternatively, bedGraph files with pre-tabulated methylation information as they are produced by MethylDackel extract or bismark methylation extractor_.
If genomic alignments are supplied as an input, mapped reads from technical replicates are first merged and coordinate-sorted using samtools sort, and the mappings for each sample are de-duplicated using picard MarkDuplicates. For concurrent processing, the alignments are then split by chromosome, and, for each sample and chromosome, the numbers of methylated/unmethylated reads per position (pileup information) are retrieved using MethylDackel extract.
The obtained pileup information from all analysed samples is summarised in a so-called genome matrix that is generated per sample and chromosome in parallel. The global genome matrix serves as input for the detection of methylated regions per sample and methylated regions (MRs) are determined by a two-state Hidden Markov Model (HMM)-based method that learns different methylation level distributions for an unmethylated and a methylated state from whole genome data. Finally, to obtain significant differences in methylation on a regional scale between different samples, MethylScore clusters samples by methylation levels and statistically tests the groupwise methylation distributions for significant differences.
All software dependencies required by MethylScore are provided in a Docker container, the only requirements to run MethylScore are Nextflow, and a supported container engine (Singularity, Docker, Charliecloud or Podman).
MethylScore requires a samplesheet. It serves to create a mapping between sample identifiers and corresponding file locations of input files and should consist of two columns separated by tabs (column headers are not required):
sampleID | path |
---|---|
S1 | /path/to/S1A.{bam,bedGraph} |
S2 | /path/to/S2A.{bam,bedGraph} |
S2 | /path/to/S2B.{bam,bedGraph} |
Samples sharing the same sampleID will be treated as technical replicates and merged prior to further processing.
To start the pipeline (using docker in this case), at least --SAMPLE_SHEET
and --GENOME
(in fasta format, same one as the reads were mapped against) have to be provided.
# If genomic alignments in bam format are provided
nextflow run Computomics/MethylScore --SAMPLE_SHEET=/path/to/samplesheet.tsv --GENOME=/path/to/reference_genome.fa -profile docker
# If bedGraph input is provided
nextflow run Computomics/MethylScore --BEDGRAPH --SAMPLE_SHEET=/path/to/samplesheet.tsv --GENOME=/path/to/reference_genome.fa -profile docker
The pipeline will create a output directory structure that looks like the following:
├── 01mappings
│ ├── S1
│ │ ├── S1.cov.avg
│ │ ├── S1.cov_stats.tsv
│ │ └── S1.read_stats.tsv
│ └── S2
│ ├── S2.cov.avg
│ ├── S2.cov_stats.tsv
│ └── S2.read_stats.tsv
├── 02consensus
│ └── mbias
│ ├── S1.Chr1_OB.svg
│ ├── S1.Chr1_OT.svg
│ ├── S2.Chr1_OB.svg
│ └── S2.Chr1_OT.svg
├── 03matrix
│ ├── genome_matrix.tsv.gz
│ └── genome_matrix.tsv.gz.tbi
├── 04MRs
│ ├── hmm_parameters
│ │ ├── S1.hmm_params
│ │ └── S2.hmm_params
│ ├── S1.MRs.bed
│ ├── S2.MRs.bed
│ └── stats
│ ├── S1.MR_stats.tsv
│ └── S2.MR_stats.tsv
├── 05DMRs
│ └── all
│ ├── DMRs.CG.bed
│ ├── DMRs.CHG.bed
│ └── DMRs.CHH.bed
├── MethylScore_graph.png
├── MethylScore_report.html
└── MethylScore_trace.txt
Contains alignment statistics for each sample.
Sorted and de-duplicated bam files are optionally stored in this directory (if run with --REMOVE_INTMED_FILES false
).
Contains mbias plots showing methylation with respect to position along the sequencing reads that should be used to (re-)assess read trimming settings as needed.
Single-cytosine pileup information for each sample is optionally stored in this directory (if run with --REMOVE_INTMED_FILES false
).
Contains the merged whole genome matrix across all samples as a bgzip compressed file, along with the corresponding tabix index.
The genome matrix for each chromosome are optionally stored in this directory (if run with --REMOVE_INTMED_FILES false
).
Contains genomic coordinates of methylated regions (MRs) as they were segmented by the Hidden Markov Model, along with associated region-based metrics. The parameters obtained from training the model on each sample are stored and can be used to reduce computational burden in subsequent pipeline runs.
The coordinates are stored in bed format, with the following columns:
column 1: chromosome ID
column 2: (1-based) start position
column 3: (1-based) end position, half-open (i.e. this position is not part of the region)
column 4: Number of covered cytosines in MR
column 5: Mean read depth in MR
column 6: 5th percentile of read depth in MR
column 7: Mean methylation rate of cytosines within MR
column 8: SampleID
Example:
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 |
---|---|---|---|---|---|---|---|
Chr1 | 597 | 651 | 23 | 11 | 12 | 12 | S1 |
Chr1 | 763 | 956 | 52 | 5 | 9 | 51 | S1 |
Contains genomic coordinates of differentially methylated regions (DMRs) that were determined as significantly different between sample clusters, after candidate region selection from MRs followed by statistical testing.
column 1: chromosome ID
column 2: (1-based) start position
column 3: (1-based) end position, half-open (i.e. this position is not part of the region)
column 4: Length in bp
column 5: Cluster-String, one symbol per sample:
1
,2
,3
,...
= cluster ID.
= sample is not covered at all positions within region-
= sample is not sufficiently covered at all positions within region (at least DMR_MIN_C
positions with a minimum read depth of DMR_MIN_COV
-fold are required)Example:
X1 | X2 | X3 | X4 | X5 | X6 | X7 | X8 | X9 | X10 | X11 | X12 |
---|---|---|---|---|---|---|---|---|---|---|---|
Chr1 | 42255 | 42409 | 155 | 11.21-211 | 1:17,20,0,12 | 2:75,80,0,28 | 1:S1,S3,... | 2:S2,S6,... | #:30,14,3,13 | CG,CHH | CHH |
Pipeline parameters can either be configured using a parameter file, or
individual parameters can be passed to the pipeline on the commandline.
For example, methylated regions can be visualized as IGV tracks by passing the --IGV
parameter via the commandline:
nextflow run Computomics/MethylScore --SAMPLE_SHEET=samplesheet.tsv --GENOME=genome.fa --IGV -profile podman
Alternatively, the repository contains a template example_config.yaml, which can be edited and used to pass custom parameters in a more reproducible manner to the pipeline using the -params-file
flag.
nextflow run Computomics/MethylScore --SAMPLE_SHEET=samplesheet.tsv --GENOME=genome.fa -params-file=/path/to/config.yaml -profile podman
required
required
default: false
default: false
default: true
default: false
default: true
default: './results'
default: true
default: false
default: true
default: 30
default: 0,0,0,0
default: 0,0,0,0
default: 1
default: 20
default: 30
default: 100
DESERT_SIZE
bp without covered cytosines, break segment and rather start separate HMM path.
This prevents extending MRs over stretches of missing data.
default: 20
default: 30
MERGE_DIST
bp close to each other.
default: 10
default: false
default: 0
SLIDING_WINDOW_SIZE
along segmented regions to breakdown candidate regions to test.
default: 0
SLIDING_WINDOW_SIZE
and SLIDING_WINDOW_STEP
are set to values greater than 0, applies sliding window of size SLIDING_WINDOW_SIZE
along selected regions to breakdown candidate regions to test, using a step size of SLIDING_WINDOW_STEP
default: false
default: 500
default: true
default: CG,CHG,CHH
default: 20
CLUSTER_MIN_METH_DIFF_CG
,CLUSTER_MIN_METH_DIFF_CHG
and CLUSTER_MIN_METH_DIFF_CHH
only apply
when DMRS_PER_CONTEXT
is set to true.
For each candidate region, cluster centers are searched to minimize the within-group variance. The value of k is iteratively incremented starting from k = 2, until the pairwise comparison of all cluster centers results in a methylation difference of less than CLUSTER_MIN_METH_DIFF
.
This effectively discards likely irrelevant DMRs with only a few percentage points methylation difference.
default: 20
CLUSTER_MIN_METH_CG
,CLUSTER_MIN_METH_CHG
and CLUSTER_MIN_METH_CHH
only apply
when DMRS_PER_CONTEXT
is set to true.
default: 3
default: 5
DMR_MIN_COV
that are required within (candidate) DMRs.
default: 0.05
default: 3