JMLab-tifrh / idpGPT

GPT models of PLMs trained to generate novel protein sequences when supplied with a prompt
GNU General Public License v3.0
1 stars 1 forks source link

idpGPT

GPT Protein Language Models to generate novel protein sequences.
Three PLMs have been trained.
LLPS+ GPT : generates sequences highly prone to liquid liquid phase separation (LLPS). model saved as llps_plus_gpt.pt.
LLPS- GPT : generates sequences which can undergo LLPS but with leser intensity than LLPS+. model saved as llps_minus_gpt.pt.
PDB* GPT : generates sequences that would not undergo LLPS unless very drastic conditions are applied. model saved as no_PS.pt

Google Colab notebooks for generation and classification are available at the following links:

Usage

please set the following environment variable to use the scripts
bash : export PYTHONPATH=$PYTHONPATH:<path to the lib directory>
csh : setenv PYTHONPATH "$PYTHONPATH:<path to the lib directory>
zsh : export PYTHONPATH=$PYTHONPATH:<path to the lib directory>
fish : set -x PYTHONPATH $PYTHONPATH <path to the lib directory>

lib is the directory present in this repository. it contains libraries used to train PLMs and generate sequences.

The above variable would need to be set everytime a terminal is opened.
Hence better way is to put the line in the respective configuration file.
bash : ~/.bashrc
csh : ~/.cshrc
zsh : ~/.zshrc
fish : ~/.config/fish/config.fish

after including the environment variable, please source the respective configuration file.
For example, in bash, source ~/.bashrc.

Each script has its usage instructions. type python <script> -h to see help.
Eg. `python train_gpt.py -h'

The output format when using the sequence generator

The sequences generated are saved in a the fasta format which is a standard format to save protein sequence(s).
Fasta files are plain text files and generally have the extension .fa or .fasta. Below is an example of a file in fasta format

>prot1
RGGAFGGKLVFFSSRGG
>prot 2
MAVCQYPLVVQQK

The line(s) starting with ">" contains the sequences identifier(s). It does not have to be unique for each sequence, but the fasta format requires an identifier. Below an identifier and before the next identifier, a protein sequence will be written. For long protein sequences, the sequences can be decomposed into multiple lines for readability. An example for a longer dummy sequence is given below:

>dummy sequence (95 residues)
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>a shorter sequence
VVVVVVVVV

Requirements

Disclaimer

We have used the codes in https://github.com/Infatoshi/fcc-intro-to-llms as a reference for our codes.