deepsignal2 has the same DNN structure with deepsignal-plant, so the pre-trained models of deepsignal2/deepsignal-plant can be used by both tools. Importantly, when using models of these two tools, note that default --seq_len
in deepsignal2 is 17, while in deepsignal-plant is 13.
deepsignal2 applies BiLSTM to detect methylation from Nanopore reads. It is built on Python3 and PyTorch.
tombo resquiggle
and/or deepsignal2 call_mods
. Normally it will work after setting HDF5_PLUGIN_PATH
:
# 1. install hdf5/hdf5-tools (maybe not necessary)
# ubuntu
sudo apt-get install libhdf5-serial-dev hdf5-tools
# centos
sudo yum install hdf5-devel
wget https://github.com/nanoporetech/vbz_compression/releases/download/v1.0.1/ont-vbz-hdf-plugin-1.0.1-Linux-x86_64.tar.gz tar zxvf ont-vbz-hdf-plugin-1.0.1-Linux-x86_64.tar.gz export HDF5_PLUGIN_PATH=/abslolute/path/to/ont-vbz-hdf-plugin-1.0.1-Linux/usr/local/hdf5/lib/plugin
References: [deepsignal-plant issue #8](https://github.com/PengNi/deepsignal-plant/issues/8), [tombo issue #254](https://github.com/nanoporetech/tombo/issues/254), and [vbz_compression issue #5](https://github.com/nanoporetech/vbz_compression/issues/5).
## Contents
- [Installation](#Installation)
- [Trained models](#Trained-models)
- [Quick start](#Quick-start)
- [Usage](#Usage)
## Installation
deepsignal2 is built on [Python3](https://www.python.org/) and [PyTorch](https://pytorch.org/). [tombo](https://github.com/nanoporetech/tombo) is required to re-squiggle the raw signals from nanopore reads before running deepsignal2.
- Prerequisites:\
[Python3.*](https://www.python.org/) (version>=3.8)\
[tombo](https://github.com/nanoporetech/tombo) (version 1.5.1)
- Direct dependencies: \
[numpy](http://www.numpy.org/) \
[h5py](https://github.com/h5py/h5py) \
[statsmodels](https://github.com/statsmodels/statsmodels/) \
[scikit-learn](https://scikit-learn.org/stable/) \
[PyTorch](https://pytorch.org/) (version >=1.2.0, <=1.11.0)
- Non-direct dependencies: \
[scipy](https://scipy.org/) \
[pandas](https://pandas.pydata.org/)
#### 1. Create an environment
We highly recommend using a virtual environment for the installation of deepsignal2 and its dependencies. A virtual environment can be created and (de)activated as follows by using [conda](https://conda.io/docs/):
```bash
# create
conda create -n deepsignal2env python=3.8
# activate
conda activate deepsignal2env
# deactivate
conda deactivate
The virtual environment can also be created by using virtualenv.
After creating and activating the environment, download deepsignal2 (latest version) from github:
git clone https://github.com/PengNi/deepsignal2.git
cd deepsignal2
python setup.py install
or install deepsignal2 using pip:
pip install deepsignal2
PyTorch can be automatically installed during the installation of deepsignal2. However, if the version of PyTorch installed is not appropriate for your OS, an appropriate version should be re-installed in the same environment as the instructions:
# install using conda
conda install pytorch==1.11.0 cudatoolkit=10.2 -c pytorch
# or install using pip
pip install torch==1.11.0
tombo is required to be installed in the same environment:
# install using conda
conda install -c bioconda ont-tombo
# or install using pip
pip install ont-tombo
The models we trained can be downloaded from google drive.
Currently, we have trained the following models:
To call modifications, the raw fast5 files should be basecalled (Guppy>=3.6.1) and then be re-squiggled by tombo. At last, modifications of specified motifs can be called by deepsignal. Belows are commands to call 5mC in CG contexts:
# 1. guppy basecall
guppy_basecaller -i fast5s/ -r -s fast5s_guppy --config dna_r9.4.1_450bps_hac_prom.cfg
# 2. tombo resquiggle
cat fast5s_guppy/*.fastq > fast5s_guppy.fastq
tombo preprocess annotate_raw_with_fastqs --fast5-basedir fast5s/ --fastq-filenames fast5s_guppy.fastq --sequencing-summary-filenames fast5s_guppy/sequencing_summary.txt --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template --overwrite --processes 10
tombo resquiggle fast5s/ /path/to/genome/reference.fa --processes 10 --corrected-group RawGenomeCorrected_000 --basecall-group Basecall_1D_000 --overwrite
# 3. deepsignal2 call_mods
# CG
CUDA_VISIBLE_DEVICES=0 deepsignal2 call_mods --input_path fast5s/ --model_path model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt --result_file fast5s.CG.call_mods.tsv --corrected_group RawGenomeCorrected_000 --motifs CG --nproc 30 --nproc_gpu 6
python /path/to/deepsignal2/scripts/call_modification_frequency.py --input_path fast5s.CG.call_mods.tsv --result_file fast5s.CG.call_mods.frequency.tsv
Before run deepsignal, the raw reads should be basecalled (Guppy>=3.6.1) and then be processed by the re-squiggle module of tombo.
Note:
multi_to_single_fast5 -i $multi_read_fast5_dir -s $single_read_fast5_dir -t 30 --recursive
For example:
# 1. run multi_to_single_fast5 if needed
multi_to_single_fast5 -i $multi_read_fast5_dir -s $single_read_fast5_dir -t 30 --recursive
# 2. basecall, fast5s/ is the $single_read_fast5_dir
guppy_basecaller -i fast5s/ -r -s fast5s_guppy --config dna_r9.4.1_450bps_hac_prom.cfg
# 3. proprecess fast5 if basecall results are saved in fastq format
cat fast5s_guppy/*.fastq > fast5s_guppy.fastq
tombo preprocess annotate_raw_with_fastqs --fast5-basedir fast5s/ --fastq-filenames fast5s_guppy.fastq --sequencing-summary-filenames fast5s_guppy/sequencing_summary.txt --basecall-group Basecall_1D_000 --basecall-subgroup BaseCalled_template --overwrite --processes 10
# 4. resquiggle, cmd: tombo resquiggle $fast5_dir $reference_fa
tombo resquiggle fast5s/ /path/to/genome/reference.fa --processes 10 --corrected-group RawGenomeCorrected_000 --basecall-group Basecall_1D_000 --overwrite
Features of targeted sites can be extracted for training or testing.
For the example data (deepsignal2 extracts 17-mer-seq and 1716-signal features of each CpG motif in reads by default. Note that the value of --corrected_group must be the same as that of --corrected-group* in tombo.):
deepsignal2 extract -i fast5s -o fast5s.CG.features.tsv --corrected_group RawGenomeCorrected_000 --nproc 30 --motifs CG
The extracted_features file is a tab-delimited text file in the following format:
To call modifications, either the extracted-feature file or the raw fast5 files (recommended) can be used as input.
GPU/Multi-GPU support: Use CUDA_VISIBLE_DEVICES=${cuda_number} ccsmeth call_mods [options]
to call modifications with specified GPUs (e.g., CUDA_VISIBLE_DEVICES=0
or CUDA_VISIBLE_DEVICES=0,1
).
For example:
# call 5mCpGs for instance
# extracted-feature file as input, use CPU
CUDA_VISIBLE_DEVICES=-1 deepsignal2 call_mods --input_path fast5s.CG.features.tsv --model_path model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt --result_file fast5s.CG.call_mods.tsv --nproc 30
# extracted-feature file as input, use GPU
CUDA_VISIBLE_DEVICES=0 deepsignal2 call_mods --input_path fast5s.CG.features.tsv --model_path model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt --result_file fast5s.CG.call_mods.tsv --nproc 30 --nproc_gpu 6
# fast5 files as input, use CPU
CUDA_VISIBLE_DEVICES=-1 deepsignal2 call_mods --input_path fast5s/ --model_path model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt --result_file fast5s.CG.call_mods.tsv --corrected_group RawGenomeCorrected_000 --motifs CG --nproc 30
# fast5 files as input, use GPU
CUDA_VISIBLE_DEVICES=0 deepsignal2 call_mods --input_path fast5s/ --model_path model.dp2.CG.R9.4_1D.human_hx1.bn17_sn16.both_bilstm.b17_s16_epoch4.ckpt --result_file fast5s.CG.call_mods.tsv --corrected_group RawGenomeCorrected_000 --motifs CG --nproc 30 --nproc_gpu 6
The modification_call file is a tab-delimited text file in the following format:
A modification-frequency file can be generated by the script scripts/call_modification_frequency.py with the call_mods file as input:
# call 5mCpGs for instance
# output in tsv format
python /path/to/deepsignal2/scripts/call_modification_frequency.py --input_path fast5s.CG.call_mods.tsv --result_file fast5s.CG.call_mods.frequency.tsv
# output in bedMethyl format
python /path/to/deepsignal2/scripts/call_modification_frequency.py --input_path fast5s.CG.call_mods.tsv --result_file fast5s.CG.call_mods.frequency.bed --bed
The modification_frequency file can be either saved in bedMethyl format (by setting --bed
as above), or saved as a tab-delimited text file in the following format by default:
A new model can be trained as follows:
# need to split training samples to two independent datasets for training and validating
# please use deepsignal2 train -h/--help for more details
deepsignal2 train --train_file /path/to/train/file --valid_file /path/to/valid/file --model_dir /dir/to/save/the/new/model
Copyright (C) 2020 Jianxin Wang, Feng Luo, Peng Ni
This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with this program. If not, see https://www.gnu.org/licenses/.
Jianxin Wang, Peng Ni, School of Computer Science and Engineering, Central South University, Changsha 410083, China
Feng Luo, School of Computing, Clemson University, Clemson, SC 29634, USA