ChEB-AI / python-chebai

GNU Affero General Public License v3.0
11 stars 4 forks source link

Protein function prediction with GO #36

Open sfluegel05 opened 2 months ago

sfluegel05 commented 2 months ago

Until now, we have only used our framework for ChEBI, but in principle, it should also be applicable to other data sets and prediction tasks. One such task is the prediction of protein functions as specified by the Gene Ontology in combination with protein data from UniProtKB. As an orientation, we can use the DeepGO paper which proposes a solution for this exact task. The goal is to apply our model to the GO / UniProtKB datasets and compare the results to those of DeepGO.

Tasks

  1. Minimal dataset implementation: Build a dataset class that extracts proteins and labels from UniProtKB / GO and processes them into a dataset that can be used to train Electra (one-hot encoding of trigrams of amino acids)
  2. Model training and evaluation: Evaluate using the same metrics as DeepGO for comparing the models
  3. Additonal features and finetuning: Pretraining with additional unlabeled protein data, trained input embeddings, hyperparameters
aditya0by0 commented 4 days ago

Protein Preprocessing Statistics

These are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:

The number of ignored proteins is very insignificant in size compared to the whole dataset.

I have attached the CSV file which lists the IDs of the ignored proteins for reference. proteins_with_issues.csv