KCLabMTU / LM-OGlcNAc-Site

Predicting O-GlcNAcylation Sites Using Cost-sensitive learning on Protein Language Model’s Embeddings
1 stars 1 forks source link
# LM-OGlcNAc-Site Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction

python Ankh Bio fair-esm Keras numpy pandas scikit_learn SciPy tensorflow PyTorch tqdm Transformers

GitHub last commit GitHub pull requests

Webserver

http://kcdukkalab.org/LMOGlcNAcSite/

Authors

Suresh Pokharel1, Pawel Pratyush1, Hamid D. Ismail1, Junfeng Ma2, Dukka B KC1
1Department of Computer Science, Michigan Technological University, Houghton, MI, USA
2 Department of Oncology, Lombardi Comprehensive Cancer Center, Georgetown University Medical Center, Georgetown University, Washington, DC 20057, USA
Corresponding Author: dbkc@mtu.edu

Clone the Repository

If Git is installed on your system, clone the repository by running the following command in your terminal:

git clone git@github.com:KCLabMTU/LM-OGlcNAc-Site.git

- Or -

Download the Repository

If you do not have Git or perfer to download directly: Download the repository directly from GitHub. Click Here to download the repository as a zip file.

Install Libraries

Python version: 3.10.0

To intall the required libraries, run the following command:

pip install -r requirements.txt

Required libraries and versions:

ankh==1.10.0
Bio==1.7.0
biopython==1.83
datasets==2.19.0
fair_esm==2.0.0
keras==2.8.0
numpy==1.26.4
pandas==2.2.2
protobuf==3.20.*
scikit_learn==1.4.2
scipy==1.13.0
tensorflow==2.8.0
torch==2.3.0
tqdm==4.66.2
transformers==4.40.1

To run LM-OGlcNAc-Site model on your own sequences

In order to predict succinylation site using your own sequence, you need to have two inputs:

  1. Copy sequences you want to predict to input/sequence.fasta
  2. Run python predict.py
  3. Find results inside the output folder in a csv file named results.csv

Commands

Use the following command to determine input and output files:

python predict.py --input [input_path] --output [output_path]

or in short form notation,

python predict.py -i [input_path] --output [output_path]

Replace: [Input] with the path of the input file you want to run the model onto MUST BE a .fasta FILE [Output] with the path of the output file you want the result to be returned to MUST BE A .csv FILE

Example:

python predict.py -i input.fasta -o output.csv

Note:

  1. You an always use the '-h' or '--help' flag to get detailed information about the available command-line arguments.
  2. You may also utilize the web server [here] (http://kcdukkalab.org/LMOGlcNAcSite/)

    Citation

    Pokharel, S.; Pratyush, P.; Ismail, H.D.; Ma, J.; KC, D.B. Integrating Embeddings from Multiple Protein Language Models to Improve Protein O-GlcNAc Site Prediction. Int. J. Mol. Sci. 2023, 24, 16000. https://doi.org/10.3390/ijms242116000

Paper Link: https://www.mdpi.com/1422-0067/24/21/16000

Contact

Please send an email to sureshp@mtu.edu (CC: dbkc@mtu.edu, ppratyus@mtu.edu for any kind of queries and discussions.