Protein function prediction with GO

ChEB-AI / python-chebai

GNU Affero General Public License v3.0

11 stars 4 forks source link

Tasks

Minimal dataset implementation: Build a dataset class that extracts proteins and labels from UniProtKB / GO and processes them into a dataset that can be used to train Electra (one-hot encoding of trigrams of amino acids)

Model training and evaluation: Evaluate using the same metrics as DeepGO for comparing the models

Additonal features and finetuning: Pretraining with additional unlabeled protein data, trained input embeddings, hyperparameters

Protein Preprocessing Statistics

These are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:

Number of proteins with non-valid amino acids: 2,672 (0.47% of the dataset)
Number of proteins with sequence length greater than 1002: 19,004 (3.32% of the dataset)
Number of proteins with both non-valid amino acids and length greater than 1002: 154
Total number of ignored proteins (either condition): 21,522 (3.76% of the dataset)
Original dataset size: 571,864 proteins

The number of ignored proteins is very insignificant in size compared to the whole dataset.

I have attached the CSV file which lists the IDs of the ignored proteins for reference. proteins_with_issues.csv

ChEB-AI / python-chebai

Protein function prediction with GO #36

Tasks

Protein Preprocessing Statistics