DeepPrime2Seq is developed deep learning-based prediction of protein secondary structure from the protein primary sequence. It facilitate the function of different features in this task, including one-hot vectors, biophysical features, protein sequence embedding (ProtVec), deep contextualized embedding (known as ELMo), and the Position Specific Scoring Matrix (PSSM).
In addition to the role of features, it allows for the evaluation of various deep learning architectures including the following models/mechanisms and certain combinations: Bidirectional Long Short-Term Memory (BiLSTM), convolutional neural network (CNN), highway connections, attention mechanism, recurrent neural random fields, and gated multi-scale CNN.
Our results suggest that PSSM concatenated to one-hot vectors are the most important features for the task of secondary structure prediction. Utilizing the CNN-BiLSTM network, we achieved an accuracy of 69.9% and 70.4% using ensemble top-k models, for 8-class of protein secondary structure on the CB513 dataset, the most challenging dataset for protein secondary structure prediction.
@article {Asgari705426,
author = {Asgari, Ehsaneddin and Poerner, Nina and McHardy, Alice C. and Mofrad, Mohammad R.K.},
title = {DeepPrime2Sec: Deep Learning for Protein Secondary Structure Prediction from the Primary Sequences},
elocation-id = {705426},
year = {2019},
doi = {10.1101/705426},
publisher = {Cold Spring Harbor Laboratory},
URL = {https://www.biorxiv.org/content/early/2019/07/18/705426},
eprint = {https://www.biorxiv.org/content/early/2019/07/18/705426.full.pdf},
journal = {bioRxiv}
}
Through error analysis on the best performing model, we showed that the misclassification is significantly more common at positions that undergo secondary structure transitions, which is most likely due to the inaccurate assignments of the secondary structure at the boundary regions. Notably, when ignoring amino acids at secondary structure transitions in the evaluation, the accuracy increases to 90.3%. Furthermore, the best performing model mostly mistook similar structures for one another, indicating that the deep learning model inferred high-level information on the secondary structure.
DeepPrime2Sec and the used datasets are available here under the Apache 2 license.
Return to the table of content ↑.
In order to install the required libraries for running DeepPrime2Sec use the following command:
pip install installations/requirements.txt
OR you may use conda installation.
In order to install the required libraries for running DeepPrime2Sec use the following conda command:
conda create --name deepprime2sec --file installations/deepprime2sec.yml
Subsequently, you need to activate the created virtual environment before running:
source activate deepprime2sec
Before running the software make sure to download the traning dataset (which was too large for git) from the following file
and extract them and copy them to the dataset
directory.
http://deepbio.info/proteomics/datasets/deepprime2sec/train_files.tar.gz
Return to the table of content ↑.
In order to run the DeepPrime2Sec, you can simply use the following command.
Every details on different deep learning models: architecture, hyper parameter, training parameters, will be provided in the yaml config file.
Here we detail how this file should be created. Examples are also provided in sample_configs/*.yaml
.
python deepprime2sec.py --config sample_configs/model_a.yaml
We experiment on five sets of protein features to understand what are essential features for the task of protein secondary structure prediction. Although in 1999, PSSM was reported as an important feature to the secondary structure prediction (Jones et al, 1999), this was still unclear whether recently introduced distributed representations can outperform PSSM in such a task. For a systematic comparison, the features detailed as follows are used:
In order to use combinations of features in the software please use the following keywords for the key of features_to_use
. features_to_use
is part of model parameters.
The included features in the config will be concatenated as input:
model_paramters:
features_to_use:
- onehot
- embedding
- elmo
- pssm
- biophysical
Return to the table of content ↑.
The following is an example of parameters for running the training and storing the results (run_parameters
).
run_parameters:
domain_name: baseline
setting_name: baseline
epochs: 100
test_batch_size: 100
train_batch_size: 64
patience: 10
gpu: 1
domain
and setting_name
The results of the model would be saved to results
directory. The domain
and setting_name
parameters will be created as directy and sub-directories inside results
to store the model weights
and results.
epoch
and batch-sizes
epoch
refers to the number of time to iterate over the training data and batch_size
refers to the size of data-split in each optimization step.
For a proper and faster learning we have already performed bucketing (sorting the training sequences according to their lengths), which minimizes the padding operations as well.
patience
To avoid overfitting we perform early stopping, meaning that if the performance only improves over the training set and not the test set after a few epoch we stop the training.
Because then it means that the model specialized to the training data by memorizing and cannot generalize further for the test set. patience
determine for how many epochs we should wait for an improvement on the test set.
gpu
Which GPU device ID to use for training/testing the model.
Return to the table of content ↑.
For the details of CNN + BiLSTM model please refer to the paper, to specify this model for the paper use deep_learning_model: a_cnn_bilstm
convs
refers to the convolution window sizes (in the following example we use 5 window sizes of 3, 5, 7, and 11).
filter_size
is the size of convolutional filters.
dense_size
is the size of feed forward layers are used before and after LSTM.
dropout_rate
is the dropout rate.
lstm_size
is the hidden size of bidirectional LSTM.
lr
is the learning rate.
features_to_use
is already covered at 3.1 Features.
Sample config file
deep_learning_model: a_cnn_bilstm
model_paramters:
convs:
- 3
- 5
- 7
- 11
- 21
filter_size: 256
dense_size: 1000
dropout_rate: 0.5
lstm_size: 1000
lr: 0.001
features_to_use:
- onehot
- pssm
For the details of CNN + + Highway Connection of PSSM model please refer to the paper, to specify this model for the paper use deep_learning_model: model_b_cnn_bilstm_highway
convs
refers to the convolution window sizes (in the following example we use 5 window sizes of 3, 5, 7, and 11).
filter_size
is the size of convolutional filters.
dense_size
is the size of feed forward layers are used before and after LSTM.
dropout_rate
is the dropout rate.
lstm_size
is the hidden size of bidirectional LSTM.
lr
is the learning rate.
features_to_use
is already covered at 3.1 Features.
use_CRF
is indicate whether you would like to include a CRF layer at the end.
Sample config file
deep_learning_model: model_b_cnn_bilstm_highway
model_paramters:
convs:
- 3
- 5
- 7
- 11
- 21
filter_size: 256
dense_size: 1000
dropout_rate: 0.5
lstm_size: 1000
lr: 0.001
features_to_use:
- onehot
- pssm
use_CRF: false
For the details of CNN + BiLSTM + Conditional Random Field Layer model please refer to the paper, to specify this model for the paper use deep_learning_model: model_c_cnn_bilstm
convs
refers to the convolution window sizes (in the following example we use 5 window sizes of 3, 5, 7, and 11).
filter_size
is the size of convolutional filters.
dense_size
is the size of feed forward layers are used before and after LSTM.
dropout_rate
is the dropout rate.
lstm_size
is the hidden size of bidirectional LSTM.
lr
is the learning rate.
features_to_use
is already covered at 3.1 Features.
CRF_input_dim
the input dimension of CRF layer.
Sample config file
deep_learning_model: model_c_cnn_bilstm_crf
model_paramters:
convs:
- 3
- 5
- 7
- 11
- 21
filter_size: 256
dense_size: 1000
dropout_rate: 0.5
lstm_size: 1000
lr: 0.001
features_to_use:
- onehot
- pssm
lstm_size: 1000
CRF_input_dim: 200
For the details of CNN + BiLSTM + Attention mechanism model please refer to the paper, to specify this model for the paper use deep_learning_model: model_d_cnn_bilstm_attention
attention_type
is the attention type to be selected from additive
or multiplicative
.
attention_units
is the number of attention units.
convs
refers to the convolution window sizes (in the following example we use 5 window sizes of 3, 5, 7, and 11).
filter_size
is the size of convolutional filters.
dense_size
is the size of feed forward layers are used before and after LSTM.
dropout_rate
is the dropout rate.
lstm_size
is the hidden size of bidirectional LSTM.
lr
is the learning rate.
features_to_use
is already covered at 3.1 Features.
use_CRF
is indicate whether you would like to include a CRF layer at the end.
Sample config file
deep_learning_model: model_d_cnn_bilstm_attention
model_paramters:
attention_type: additive
attention_units: 32
convs:
- 3
- 5
- 7
- 11
- 21
filter_size: 256
dense_size: 1000
dropout_rate: 0.5
lstm_size: 1000
lr: 0.001
features_to_use:
- onehot
- pssm
lstm_size: 1000
use_CRF: false
For the details of CNN model please refer to the paper, to specify this model for the paper use deep_learning_model: model_e_cnn
convs
refers to the convolution window sizes (in the following example we use 5 window sizes of 3, 5, 7, and 11).
filter_size
is the size of convolutional filters.
dense_size
is the size of feed forward layers are after the concatenation of convlolution results.
dropout_rate
is the dropout rate.
lr
is the learning rate.
features_to_use
is already covered at 3.1 Features.
use_CRF
is indicate whether you would like to include a CRF layer at the end.
Sample config file
deep_learning_model: model_e_cnn
model_paramters:
convs:
- 3
- 5
- 7
- 11
- 21
filter_size: 256
dense_size: 1000
dropout_rate: 0.5
lstm_size: 1000
lr: 0.001
features_to_use:
- onehot
- pssm
lstm_size: 1000
use_CRF: false
For the details of Multiscale CNN model please refer to the paper, to specify this model for the paper use deep_learning_model: model_f_multiscale_cnn
multiscalecnn_layers
how many gated muliscale CNNs should be stacked.
cnn_regularizer
regularizing parameter for the CNN.
convs
refers to the convolution window sizes (in the following example we use 5 window sizes of 3, 5, 7, and 11).
filter_size
is the size of convolutional filters.
dense_size
is the size of feed forward layers are after the concatenation of convlolution results.
dropout_rate
is the dropout rate.
lr
is the learning rate.
features_to_use
is already covered at 3.1 Features.
use_CRF
is indicate whether you would like to include a CRF layer at the end.
Sample config file
deep_learning_model: model_f_multiscale_cnn
model_paramters:
cnn_regularizer: 5.0e-05
multiscalecnn_layers: 3
convs:
- 3
- 5
- 7
- 11
- 21
filter_size: 256
dense_size: 1000
dropout_rate: 0.5
lstm_size: 1000
lr: 0.001
features_to_use:
- onehot
- pssm
lstm_size: 1000
use_CRF: false
Return to the table of content ↑.
Create your own model by just using the template of model_a to .._f, and test its performance against the existing methods.
Return to the table of content ↑.
Finally after completion of training, DeepPrime2Seq generate a PDF of the report with the following information at results/$domain/$setting/report.pdf
: