Open sfluegel05 opened 2 months ago
These are the statistics for the proteins that were ignored during preprocessing due to either non-valid amino acids or sequence lengths greater than 1002, as per the guidelines outlined in the paper:
The number of ignored proteins is very insignificant in size compared to the whole dataset.
I have attached the CSV file which lists the IDs of the ignored proteins for reference. proteins_with_issues.csv
Until now, we have only used our framework for ChEBI, but in principle, it should also be applicable to other data sets and prediction tasks. One such task is the prediction of protein functions as specified by the Gene Ontology in combination with protein data from UniProtKB. As an orientation, we can use the DeepGO paper which proposes a solution for this exact task. The goal is to apply our model to the GO / UniProtKB datasets and compare the results to those of DeepGO.
Tasks