comprna / reorientexpress

Transcriptome long-read orientation with Deep Learning
MIT License
9 stars 4 forks source link
convolutional-neural-networks deep-learning long-read-sequencing machine-learning multilayer-perceptron-network nanopore-sequencing

ReorientExpress DOI

ReorientExpress is a program to create, test and apply models to predict the 5'-to-3' orientation of long-reads from cDNA sequencing with Nanopore or PacBio using deep neural networks for samples without a genome or a transcriptome reference. For details on the benchmarkings and analyses performed with this program, please see our publication: https://www.ncbi.nlm.nih.gov/pubmed/31783882


Table of Contents


ReorientExpress is a tool to predict the orientation of cDNA reads from error-prone long-read sequencing technologies. It was developed with the aim to orientate nanopore long-reads from unstranded cDNA libraries without the need of a genome or transcriptome reference, but it is applicable to any set of long-reads. ReorientExpress implements two Deep Neural Network models: a Multi-Layer Perceptron (MLP) and a Convolutional Neural Network (CNN), and it uses as training input a transcriptome annotation from any species or any other fasta/fasq file of RNA/cDNA sequences for which the orientation is known. Training or testing data can thus be experimental data, annotation data or also mapped reads (providing the corresponding PAF file). ReorientExpress has three main modes:

These are implemented in three options: train, test and predict. In train mode, the input data is randomly split into three subsets: training, validation and test, with relative proportions of 0.75, 0.125 and 0.125, respectively. The training set is used to train the weights of the DNN model, the validation set is used to optimize the weights during the training process, and the test set has never been seen for training and is only used at the end to evaluate the accuracy of the model.


Installation


ReorientExpress has been developed in Python 3.6. It can be directly cloned and used or installed for an easier reuse and dependency management.

Currently, you can use pip to do an authomatic installation:

pip3 install reorientexpress

If some dependencies are not correctly downloaded and installed, using the following can fix it:

pip3 install -r requirements.txt
pip3 install reorientexpress

Once the package is installed, ReorientExpress can be used from the command line as any other program.

If you want to ensure you have the latest version, we recommend cloning the repository instead, althought you will have to manage the dependencies yourself.


Commands and options


Once the package is installed it can be used as an independent program. ReorientExpress has three main functions, one of them must be provided when calling the program:

The different options available for MLP (reorientexpress.py) are:

The different option available for CNN (reoreintexpress-cnn.py) are:


Inputs and Outputs


All the input sequence files can be in fasta or fastq format. They can also be compressed in gz format.

Input sequences can be of three different types, which we call experimental, annotation or mapped, which can be in FASTA or FASTQ formats, either compressed (in .gz format) or uncompressed.

Examples of possible inputs:

Experimental

@0e403438-313b-4497-b1c2-2fd3cc685c1d runid=46930771ed1cff73b50bf5c153000aa904eb5c9c read=100 ch=493 sta
rt_time=2017-10-09T18:11:16Z
CCCGGAAAAUGGUGAAGAAAAUUGAAAUCAGCCAGCACGUCCGUUAAGUCACUUGCUUUACCGCGGCAAACCAAGAUGAAGACGAGCUGUGGGAUCUGGCACUA
CUGUGGUUCCAUUGCAUGAACGGGAAGACAGUGGCUGGCGGGUGCCCUGGACGUACAAAUACCACUCCAAUUGUCACGGUAAAGUCCGCCAUCAGAAGACUGAA
GGAGUUGUAGACCAGUAGACGUUCCAUACACAUUGAGACACUACUGGCCUAUAAUAAUUAAAUGGGUUAUUAAUUUAUUUAUGGCUAACAAAUUGUUCCGAGCU
CGUAUUAAACAGAUAUCGAUGUUGUAUUGUUGUAGUAGUAUUGAAGAGCAAAUCCCACCCAUCCUUCCAUCAACAACCUCCCGUUAUUAUACCGUUAUCCCACC
GCCUACCAUCUUCCCAUAAAAUCCAUC
+
$)/*+7B:314:3.,/.6C;4.*'+69-.14:221'%&#"+)'$$%*)'$%&&)*''(+"$&$%)1*.:/0:7522222/--**--*++*/9>/0-&*('%%%)
,+&031=12+(**)#$#$$'&%((-.-4524,,4*+-:.-./(('@7-)5$'%)))3.,)**-),--/*(/0)(%+1.7*+6)+*7:32&'&*,,(/(('.-1/
3.+../)$-/29:66,*-,&.+.8,(#'&&&')1-//.--((%)(111+''&11,2(%&*./,)5..*'*%.0011%$%%#%'-&(-5+,@6>9;'-)5)**%$
#+*,,,15.''%(*)++,,4,---/064'))()($%#%''*-%&'$'##$$)&'+.%+4,(%'*&$/(&''(0(%/',$,.(&)'#,-$$$'-"$$$$&.+%($
"*+$$$$$%$$#0:*'&%&'+#$&$$"

Annotation

>ENSMUST00000193812.1|ENSMUSG00000102693.1|OTTMUSG00000049935.1|OTTMUST00000127109.1|4933401J01Rik-201|4933401J01Rik|1070|TEC|
AAGGAAAGAGGATAACACTTGAAATGTAAATAAAGAAAATACCTAATAAAAATAAATAAA
AACATGCTTTCAAAGGAAATAAAAAGTTGGATTCAAAAATTTAACTTTTGCTCATTTGGT
ATAATCAAGGAAAAGACCTTTGCATATAAAATATATTTTGAATAAAATTCAGTGGAAGAA
TGGAATAGAAATATAAGTTTAATGCTAAGTATAAGTACCAGTAAAAGAATAATAAAAAGA
AATATAAGTTGGGTATACAGTTATTTGCCAGCACAAAGCCTTGGGTATGGTTCTTAGCAC
TAAGGAACCAGCCAAATCACCAACAAACAGAGGCATAAGGTTTTAGTGTTTACTATTTGT
ACTTTTGTGGATCATCTTGCCAGCCTGTAGTGCAACCATCTCTAATCCACCACCATGAAG
GGAACTGTGATAATTCACTGGGCTTTTTCTGTGCAAGATGAAAAAAAGCCAGGTGAGGCT
GATTTATGAGTAAGGGATGTGCATTCCTAACTCAAAAATCTGAAATTTGAAATGCCGCCC

Mapped

Takes a file with the same format as experimental and also a PAF file with the following format:

0M1I3M2D4M3D1M1D10M4I11M1D25M1D6M1D10M1D10M
0e04dd74-26bd-47e3-91bf-0e6e97310067    795 2   410 -   ENST00000584828.5|ENSG0000018406
0.10|OTTHUMG00000132868.4|OTTHUMT00000444515.1|ADAP2-209|ADAP2|907|protein_coding|  907 398 
798 344 432 1   NM:i:88 ms:i:336    AS:i:336    nn:i:0  tp:A:P  cm:i:7  s1:i:82
s2:i:67 dv:f:0.1443 cg:Z:4M1I19M2D15M1I8M4I1M1D6M1I29M1D1M2D13M1D5M2I4M1D21M3I28M2I11M3I8M1I13M2I16M
3D12M1I2M3D5M2I16M2I14M4D12M1I9M4I47M2D1M3D24M2I7M1D25M
0e04dd74-26bd-47e3-91bf-0e6e97310067    795 2   405 -   ENST00000585130.5|ENSG0000018406
0.10|OTTHUMG00000132868.4|OTTHUMT00000444510.1|ADAP2-211|ADAP2|2271|nonsense_mediated_decay|    2271    
1366    1762    340 426 0   NM:i:86 ms:i:334    AS:i:334    nn:i:0  tp:A:S  cm:i:6  
s1:i:67 dv:f:0.1427 cg:Z:19M2D15M1I7M1I3M2I6M1I29M1D1M2D13M1D5M2I4M1D21M3I28M2I11M3I8M1I13M2I16M3D12
M1I2M3D5M2I16M2I14M4D12M1I9M4I47M2D1M3D24M2I7M1D25M
0e04dd74-26bd-47e3-91bf-0e6e97310067    795 2   405 -   ENST00000330889.7|ENSG0000018406
0.10|OTTHUMG00000132868.4|OTTHUMT00000256346.1|ADAP2-201|ADAP2|2934|protein_coding| 2934    1446    
1842    340 426 0   NM:i:86 ms:i:334    AS:i:334    nn:i:0  tp:A:S  cm:i:6  s1:i:67

You can read more about the paf file format here.

Examples of possible outputs:

Depending on the chosen pipeline, the output can be:

Index ForwardSequence Score orientation
0 ATGTTGAATAGTTCAAGAAAATATGCTTGTCGTTCCCTATTCAGACAAGCGAACGTCTCA 0.8915960788726807 0
1 TTGAGGAGTGATAACAAGGAAAGCCCAAGTGCAAGACAACCACTAGATAGGCTACAACTA 0.9746999740600586 1
2 AAGGCCACCATTGCTCTATTGTTGCTAAGTGGTGGGACGTATGCCTATTTATCAAGAAAA 0.9779879450798035 0

Note: '0' orientation represents '+' and '1' orientation represents '-'. However, the '-' reads are reverses-complemented and provided in the 'ForwardSequence' column.


Usage example


Note: The below commands are for MLP model. Similar commands can be used for CNN model will the replacement of reorientexpress.py with reoreintexpress-cnn.py

To train a model:

reorientexpress.py -train -data path_to_data -source annotation --v -output my_model

This trains a model with the data stored in path_to_data, which is an annotation file, suchs as a transcriptome and outputs a file called my_model.model which can be later used to make predictions. Prints relevant information.

Example on test_case provided in the repo:

reorientexpress.py -train -data ./test_case/annotation/gencode.vM19.transcripts_50k.fa -source annotation --v -output my_model

or

reorientexpress-cnn.py -train -data ./test_case/annotation/gencode.vM19.transcripts_50k.fa -source annotation --v -output my_model

To make predictions:

reorientexpress.py -predict -data path_to_data -source experimental -model path_to_model -output my_predictions

or

reorientexpress.py -output_fastq -predict -data path_to_data -source experimental -model path_to_model -output my_predictions

This takes the experimental data stored in path_to_data and the model stored in path_to_model and predicts the 5'-to-3' orientation of reads, i.e. converts to forward reads the reads that the model predicts are reverse complemented, printing the results in my_predictions.csv. The output format is same as provided in the 'Examples of possible outputs section above'

In the saved_models/ folder we provide a model trained with the human transcriptome annotation and a model trained with the Saccharomyces cerevisiae transcriptome annoation. They can be directly used with the "-model" flag.

Example on test_case provided in the repo:

reorientexpress.py -predict -data ./test_case/experimental/Hopkins_Run1_20171011_1D.pass.dedup_60_unique_50k.fastq -model ./saved_models/Hs_transcriptome_mlp.model -source experimental -output my_predictions

or

reorientexpress-cnn.py -predict -data ./test_case/experimental/Hopkins_Run1_20171011_1D.pass.dedup_60_unique_50k.fastq -model ./saved_models/Hs_transcriptome_CNN.model -source experimental -output my_predictions

or

reorientexpress-cnn.py -output_fastq -predict -data ./test_case/experimental/Hopkins_Run1_20171011_1D.pass.dedup_60_unique_50k.fastq -model ./saved_models/Hs_transcriptome_CNN.model -source experimental -output my_predictions

To test the accuracy of the model:

reorientexpress.py -test -data path_to_data -annotation path_of_paf_file -source mapped -model path_to_model 

Example on test_case provided in the repo:

reorientexpress.py -test -data ./test_case/mapped/Hopkins_Run1_20171011_1D.pass.dedup_60_unique_2000.fastq -annotation ./test_case/mapped/cdna_human_no_secondary_mapq_60_unique_2000.paf -model ./saved_models/Hs_transcriptome_mlp.model -source mapped

or

reorientexpress-cnn.py -test -data ./test_case/mapped/Hopkins_Run1_20171011_1D.pass.dedup_60_unique_2000.fastq -annotation ./test_case/mapped/cdna_human_no_secondary_mapq_60_unique_2000.paf -model ./saved_models/Hs_transcriptome_CNN.model -source mapped

The ouput accuracy (precision, recall, F1-score, support) will be displayed on the screen.


How to cite ReorientExpress


Ruiz-Reche A, Srivastava A, Indi JA, de la Rubia I, Eyras E. ReorientExpress: reference-free orientation of nanopore cDNA reads with deep learning. Genome Biol. 2019 Nov 29;20(1):260. doi: 10.1186/s13059-019-1884-z. https://www.ncbi.nlm.nih.gov/pubmed/31783882