DInoPORE is a computational method to detect Adenosine-to-Inosine (A-to-I) editing sites from direct-RNA sequencing data. Editing sites identified will also have their editing rate estimated.
This GitHub repository contains the scripts to run DInoPORE. Users who wish to train their own models using DInoPORE's architecture can also find the scripts for training (see training path section below).
A sample dataset to illustrate how the pipeline runs and the typical runtime for a small sample can be found in DInoPORE's Code Ocean capsule (DOI: 10.24433/CO.2180901.v1). This dataset contains 53,283 direct-RNA-seq reads from Xenopus Laevis stage 9 rep 1. R Markdown files reproducing key results from the DInoPORE manuscript can also be found in the Code Ocean capsule under /data/Rmarkdown.
Listed below are the computational environment and software used. Usage documentation can be found in the last section of this readme.
=====================================================================================
Guppy_basecaller 3.2.4 (Please modify S1.Basecall_map_nanopolish.sh to point to it)
Third party software included
conda env
R Packages (R version 4.1)
DInoPORE has been tested on CentOS Linux 7 and Ubuntu 20.04.
=====================================================================================
conda create -n myenv python=3.8.5 h5py=2.10.0 nanopolish=0.11.1 pillow=8.3.1 pyyaml=5.4.1 requests=2.26.0 samtools=1.9 scipy=1.7.1
conda activate myenv
=====================================================================================
bash mainscript1.sh -e <path/to/exptdir> -r <path/to/ref.fa> -n
bash mainscript2.sh -e <path/to/exptdir> -r <path/to/ref.fa> -n
Arguments required
-e Full path to directory containing "fast5" folder. User must have write permission for the parent directory of this directory too.
-r Full path to reference genome FASTA
-n Number of threads available to DInoPORE
-g User-defined group name used to specify runs belonging to same group. Affects aggregation step when aggregating across multiple experiment runs.
-c Class and edit rate reference containing ground truth information about a coordinates' class and edit rate (see code/misc/Example_classref.tsv for format)
Optional argument
-d [y/n] Delete base-called fastq files? Default value is n.
E.g. bash mainscript.sh -e /data/xen_s9_r1_50k -r /data/reference/xenLae2.fa -n 15 -g xen50k -c /data/xen_s9_r1_50k/groundtruth_class_regression.tsv
NOTE: mainscript.sh expects to find "fast5" directory within exptdir:
path/to/exptdir
└── fast5
=====================================================================================
Script(s):
S1.Basecall_map_nanopolish.sh (input: $exptdir $ref $numcore)
Output:
${exptdir}/out_fastq_bam/${expt}.combined.fastq
${exptdir}/out_fastq_bam/${expt}.combined.gm2.sam
${exptdir}/out_fastq_bam/${expt}.sorted.bam
${exptdir}/out_nanopolish/$expt.gm2.nanopolish_eventAlignOut.txt
Script(s):
S2.Process_bam_nnpl.sh (input: $exptdir $ref $numcore)
└── s2.Sam2tsv_processtsv.sh
└── s2.Combine_raw_nnpl.sh
└── s2.Combine_raw_nanopolish.R
Output:
${exptdir}/raw_features/$expt.tsv.txt
${exptdir}/raw_features/$expt.gm2.nanopolish_eventAlignOut_combined.txt
Script(s):
S3.Generate_raw_features.sh (input: $exptdir $numcore $agggrp)
└── S3.Generate_raw_features.sh
└── s3.Generate_raw_feature_table.R
Output:
$(dirname $exptdir)/aggregate_reads/${expt}.tsv_nnpl_inAE.txt_grp${agggrp}
Script(s):
S4.Aggregate_reads_into_pos.sh (input: $exptdir $numcore $agggrp)
└── s4.Aggregating_reads_pos.R
Output:
$(dirname $exptdir)/matrix_CNN/$agggrp.Agg.morefts.10bin.inML.txt
NOTE: This script aggregates all files in matrix_CNN that were processed with the same user-defined group name (-g) and generates one file per group for downstream processing.
We provided 3 trained models for testing. Users can used our models to detect Inosine and quantify editing rate on their own data.
NOTE: Models 1 and 2 was trained using H9 and Xenopus Laevis data. Model 3 was trained using editing sites in H9, Xenopus Laevis, HCT116, Mus musculus (Mouse) and synthetic RNAs.
Script(s):
S5.Transform_dim.sh (input: $exptdir $numcore $agggrp $classref)
└── s5.Preprocess_data_matrix_inputCNN.R
Output:
$(dirname $exptdir)/matrix_CNN/$agggrp.morefts.input_CNN_regression_modgen.RData
Script(s):
S6.Predict.sh (input: $exptdir $agggrp $numcore $classref)
└── s6.Predict_test_data.R
Output:
$(dirname $exptdir)/matrix_CNN/$agggrp.output_prediction_CNN_class0.txt
$(dirname $exptdir)/matrix_CNN/$agggrp.output_prediction_CNN_class1.txt
$(dirname $exptdir)/matrix_CNN/$agggrp.output_prediction_CNN_class2.txt
Users can train models using their own data, based on our models' architecture. Depending on the dataset size, number of epochs, batch size and GPU type, time required for training will vary.
For our data: model 3 was trained using 265,631 positions with 893,865 matrices, on GPU NVIDIA GeForce RTX3080 with batch size = 1024. Time for 1 epoch is 88 seconds and there were 900 epochs.
NOTE 1: User should make sure the training data set is balance between 3 classes, especially class 0 (unmod) and class 1 (Inosine or other modifications). Number of matrices for training set should be at least 50,000 for model 1 and model 2 and at least 500,000 for model 3
NOTE 2: Ground truth for training path must have 5 columns: contig, position, strand (p for positive and n for negative), edit (0,1 and 2), and rate (editing rate 0-1 for class 1 and -1 for class 0 and 2) (For example, see Dinopore/code/misc/Example_ground_truth_training.txt)
NOTE 3: User should make sure number of epochs is at least 500 epochs for model 1 and model 2, and at least 900 epochs for model 3
NOTE 4: Scripts for training uses GPU. If users wish to use CPU for training, please remove the following lines of code from the respective scripts: s6a.Training_classification_model_3class.R (lines 58-59), s6b.Training_classification_model_2class.R (lines 58-59), s6c.Training_regression_model.R (lines 74-75).
NOTE 5: Training scripts should be run in the same directory as the input file (usually matrix_CNN). The output file will be generated in the same directory.
Script(s):
Rscript s5.Train_preprocess_data_matrix_inputCNN_train_val.R -t $numcore -i $input -o $output -c $classref
where
input=${groupName}.Agg.morefts.10bin.inML.txt
output=${groupName}
classref="groundtruth_training_file" (see NOTE 2 above)
Output:
${groupName}.validation_matrix.rds (used as $vali in step 6a-c)
${groupName}.training_matrix.rds (used as $train in step 6a-c)
Script(s):
Rscript s6a.Training_classification_model_3class.R -v $vali -t $train -o $groupName -e $epoch -b $batch -s $seed
where
vali=${groupName}.validation_matrix.rds
train=${groupName}.training_matrix.rds
Output:
${groupName}.best_pos5_mix_3class_resnet.h5
Script(s):
Rscript s6b.Training_classification_model_2class.R -v $vali -t $train -o $groupName -e $epoch -b $batch -s $seed
Output:
${groupName}.best_pos5_mix_3c_1vs1_resnet.h5
Script(s):
Rscript s6c.Training_regression_model.R -v $vali -t $train -o $groupName -e $epoch -b $batch -s $seed
Output:
${groupName}.best_regression_morefts_16384_1024_asin06.h5
Script(s):
Rscript s7.Predict_test_data_using_trained_models.R -i $input -t $numcore -m $model1 -M $model2 -r $model3
where
input=${groupName}.validation_matrix.rds
model1=${groupName}.best_pos5_mix_3class_resnet.h5
model2=${groupName}.best_pos5_mix_3c_1vs1_resnet.h5
model3=${groupName}.best_regression_morefts_16384_1024_asin06.h5
Output:
${groupName}.output_prediction_CNN_class0.txt
${groupName}.output_prediction_CNN_class1.txt
${groupName}.output_prediction_CNN_class2.txt