XieResearchGroup / DISAE

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization
Other
11 stars 4 forks source link
orphan-proteins

DISAE

MSA-Regularized Protein Sequence Transformer toward Predicting Genome-Wide Chemical-Protein Interactions: Application to GPCRome Deorphanization

Shield: CC BY-NC 4.0

This work is licensed under a Creative Commons Attribution-NonCommercial 4.0 International License.

CC BY-NC 4.0

This is the repository to replicate experiments for the fine-tuning of classifier with pretrained ALBERT in the paper DISAE.

----------- INSTRUCTION -----------

1. Install Prerequisites

2. Clone this repository

3. Download Data

All data could be download here and put it under this repository, i.e. in the same directory as the finetuning_train.py.

There will be four subdirectories in the data folder.

image

4. Generate clusters:

1. Cluster your protein dataset with `cdhit.sh`. Input is fasta file with all protein sequences in your dataset.
2. Apply multi-sequence alignment to the clusters with Clustal Omega. (`clustalo.sh`)
3. Build hmm profiles for the clusters with hmmbuild. (`hmmer_build.sh`)
4. Redo multi-sequence alignment with the hmm profiles and HMP clusters with HMMER. (`hmmer_align.sh`)
5. Construct corpus (singlets and triplets, represent sequence and all sequences) with `construct_hmp_singlets_and_triplets.py`. This step could take long if use only one CPU. Multiprocessing can significantly reduce computing time.
6. Generate TFRecord with the corpus with `create_tfrecords.sh`.

5. Run Finetuning

To run ALBERT model (default: ALBERRT frozen transformer):

python finetuning_train.py --protein_embedding_type="albert"

To try other freezing options, change "frozen_list" to choose modules to be frozen.

To run LSTM model:

python finetuning_train.py --protein_embedding_type="lstm"

distilled-and-architecture