MethyLasso

A segmentation approach to analyze DNA methylation patterns and identify differentially methylation regions from whole-genome datasets.

MethyLasso models DNA methylation data using a nonparametric regression framework known as a Generalized Additive Model. It relies on the fused lasso method to segment the genome by estimating regions in which methylation is constant to study DNA methylation patterns in a single condition or capture dynamic changes of DNA methylation levels across conditions.
MethyLasso identifies low-methylated regions (LMRs), unmethylated regions (UMRs), DNA methylation valleys (DMVs) and partially methylated domains (PMDs) in a single condition as well as differentially methylated regions (DMRs) between two conditions.

By Delphine Balaramane, Yannick G Spill, Michaël Weber & Anaïs F Bardet.

Publication: https://www.biorxiv.org/content/10.1101/2023.07.27.550791v1

1. Installation

Dependencies:

Requires R 3.6 or greater
Requires R packages: Rcpp, RcppEigen, methods,scales, Matrix, matrixStats data.table, ggplot2, foreach, doParallel, stringr
Requires anaconda (if you use conda installation)

mkdir MethyLasso
unzip methylasso-main.zip
R CMD INSTALL --preclean methylasso-main

Conda installation:

unzip methylasso-main.zip
conda env create -f methylasso-main/conda_env.yml
conda activate MethyLasso
R CMD INSTALL --preclean methylasso-main

2. Usage

Using data from two different conditions:
Rscript MethyLasso.R --n1 [name1] --c1 [condition1_replicate1,condition1_replicate2,..] --n2 [name2] --c2 [condition2_replicate1,condition2_replicate2,..] [options]

Using data from only one condition (not calling DMRs):
Rscript MethyLasso.R --n1 [name1] --c1 [condition1_replicate1,condition1_replicate2,..] [options]

3. Arguments

argument	description
--n1	Name of condition 1 (used in output file names)
--c1	Input file(s) for condition 1 (separated by comma for replicates)
--n2	Name of condition 2 (used in output file names and corresponding to reference condition for DMR identification)
--c2	Input file(s) for condition 2 (separated by comma for replicates)

4. Input format

By Default, Methylasso takes a tab-delimited file containing for each CpG:
chr / start / end / percent_methylation / count_methylated / count_unmethylated This file can be generated within the Bismark pipeline with the command Coverage2Cytosine with option --mergeCpG.

Alternatively, files with different formats can be provided by specifying the columns. For each CpG, their genomic coordinates should be in the first three columns (chr / start / end) and columns containing the counts of methylated and unmethylated Cs OR total count and methylation percentage should be specified.

argument	description
--mC	Column number containing count of methylated Cs
--uC	Column number containing count of unmethylated Cs

5. Options

argument	description
-c	Minimum coverage per CpG (default 5)
-n	Minimum number of CpGs in LMRs, UMRS, DMVs and DMRs (default 4)
-m	Maximum mean methylation for PMDs (default 0.7)
-d	Minimum methylation difference for DMRs [0-1] (default 0.1)
-p	P-value threshold for DMRs (default 0.05)
-q	Q-value (FDR) threshold for DMRs (to be used instead of p-value threshold)
-r	Replicate coverage score for CpGs in DMRs (default 0.7)
-s	Skip LMR, UMR, DMV and PMD identification (default FALSE)
-t	Number of threads to use (default 1)
-o	Output directory (default current directory)
-f	Output figures in pdf format (default TRUE)
--quiet	Do not print processing information (default FALSE)
--version	Print version
--help	Print help

6. Additional options for advances configurations

Argument	Description
--umrlmr_lambda2	Segmentation value for UMR/LMR identification (default 25)
--tol_val	Tolerance value for optimization (default 0.01)
--pmd_lambda2	Segmentation value for PMD identification (default 1000)
--pmd_std_threshold	Standard deviation threshold for PMD (default 0.15)
--umr_std_threshold	Standard deviation threshold for UMR (default 0.1)
--umr_max_beta	Maximum methylation for UMR (default 0.1)
--lmr_max_beta	Maximum methylation for LMR (default 0.5)
--valley_max_beta	Maximum methylation for valleys (default 0.1)
--pmd_valley_min_width	Minimum width for PMD valleys (default 5000)
--max_distance	Maximum distance for merging segments (default 500)
--min_width	Minimum width for segments (default 30)
--umr_large_width	Large width threshold for UMRs (default 500)
--umr_large_density	Density threshold for large UMRs (default 0.03)
--dmr_lambda2	Segmentation value for DMR identification (default 25)
--min.overlap.pc	Minimum overlap percentage for DMRs (default 10)
--flank.width	Minimum width of flanking segments (default 300)
--flank.dist.max	Maximum distance of UMR/LMR to flanking segment (default 5000)
--flank.num.cpgs	Consider as small UMR/LMRs those which have num.cpgs smaller than this threshold (default 10)
--flank.beta.min	For small UMR/LMRs, require that flanking segments have mean methylation above this threshold (default 0.5)
--split.pmds	Whether to split PMDs which contain UMRs (default TRUE)
--merge.pmds	Whether to merge adjacent PMDs into a single region (default TRUE)
--drop	Whether to drop verbose columns and unannotated segments (default TRUE)

7. Output files and plots

Two histograms are generated for each replicate in each condition in a pdf file name_replicate_plots.pdf containing: a distribution of the sequencing depth and distribution of the methylation of CpGs obtained after applying the minimum coverage per CpG (option -m).

For the LMR/UMR/DMV/PMD part, two tab-delimited files will be generated for each condition name_lmr_umr_dmv.tsv and name_pmd.tsv containing: chr / start / end / number of CpGs / methylation / standard deviation / category

After PMDs identification, a smooth scatter plot containing the standart deviation and mean methylation of each PMDs is generated in a pdf file for each condition name1_pmds_plot.pdf and name2_pmds_plot.pdf.

For the DMR part, one tab-delimited file will be generated name1_vs_name2_dmrs.tsv containing: chr / start / end / number of CpGs condition 1 / number of CpGs condition 2 / replicate coverage score / mean methylation 1 / mean methylation 2 / methylation difference / p-value / q-value / annotation If the first part is not run (option -s), the last annotation column will not be added.

After DMRs identification, a smooth scatter plot containing the mean methylation 1 and mean methylation 2 of each DMRs is generated in a pdf file name1_vs_name2_dmrs_plot.pdf.

8. Example

An example of command line to run MethyLasso is provided using files from Hansen et al. 2011 comparing healthy and colon cancer conditions with 3 replicates for each (healthy_1, healthy_2, healthy_3, cancer_1, cancer_2, cancer_3). The input files contain for each CpG , their genomic coordinates in the first three columns (chr / start / end), the total Count of Cs in columns four (--cov 4) and the percentage of methylation in columns 5 (--meth 5). DMRs will be identified with a q-value threshold 0.05 (-q 0.05).

Rscript MethyLasso.R --n1 healthy --c1 WGBS_healthy_1_meth.bed.gz,WGBS_healthy_2_meth.bed.gz,WGBS_healthy_3_meth.bed.gz --n2 cancer --c2 WGBS_cancer_1_meth.bed.gz,WGBS_cancer_2_meth.bed.gz,WGBS_cancer_3_meth.bed.gz --cov 4 --meth 5 -q 0.05

Output files obtained by running this command are LMRs/UMRs/DMVs file healthy and cancer , PMDs file for healthy and cancer and DMRs file cancer vs healthy.

Also plots containing information on each input files are generated for healthy condition:
healthy1 healthy2 healthy3

and for cancer condition:
cancer1 cancer2 cancer3

A plot on PMDs for each condition, where the user can see what threshold to take if he wants to recover more methylated PMDs and change the -m option.
pmd_healthy pmd_cancer

And a plot on DMRs methylation in cancer and healthy conditions.

bardetlab / methylasso

readme