GPT Protein Language Models to generate novel protein sequences. Three PLMs have been trained. LLPS+ GPT : generates sequences highly prone to liquid liquid phase separation (LLPS). model saved as llps_plus_gpt.pt. LLPS- GPT : generates sequences which can undergo LLPS but with leser intensity than LLPS+. model saved as llps_minus_gpt.pt. PDB* GPT : generates sequences that would not undergo LLPS unless very drastic conditions are applied. model saved as no_PS.pt
Google Colab notebooks for generation and classification are available at the following links:
please set the following environment variable to use the scripts
bash : export PYTHONPATH=$PYTHONPATH:<path to the lib directory>
csh : setenv PYTHONPATH "$PYTHONPATH:<path to the lib directory>
zsh : export PYTHONPATH=$PYTHONPATH:<path to the lib directory>
fish : set -x PYTHONPATH $PYTHONPATH <path to the lib directory>
lib
is the directory present in this repository. it contains libraries used to train PLMs and generate sequences.
The above variable would need to be set everytime a terminal is opened.
Hence better way is to put the line in the respective configuration file.
bash : ~/.bashrc
csh : ~/.cshrc
zsh : ~/.zshrc
fish : ~/.config/fish/config.fish
after including the environment variable, please source the respective configuration file.
For example, in bash, source ~/.bashrc
.
Each script has its usage instructions. type python <script> -h
to see help.
Eg. `python train_gpt.py -h'
The sequences generated are saved in a the fasta format which is a standard format to save protein sequence(s). Fasta files are plain text files and generally have the extension .fa or .fasta. Below is an example of a file in fasta format
>prot1
RGGAFGGKLVFFSSRGG
>prot 2
MAVCQYPLVVQQK
The line(s) starting with ">" contains the sequences identifier(s). It does not have to be unique for each sequence, but the fasta format requires an identifier. Below an identifier and before the next identifier, a protein sequence will be written. For long protein sequences, the sequences can be decomposed into multiple lines for readability. An example for a longer dummy sequence is given below:
>dummy sequence (95 residues)
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>a shorter sequence
VVVVVVVVV
We have used the codes in https://github.com/Infatoshi/fcc-intro-to-llms as a reference for our codes.