Protein function prediction with GO - Part 2

aditya0by0 commented 1 month ago

PR for the Issue #36

Note: The above issue will be implemented in 2 PRs:

39 (Merged)
57

Tasks

Minimal dataset implementation: Build a dataset class that extracts proteins and labels from UniProtKB / GO and processes them into a dataset that can be used to train Electra

aditya0by0 commented 1 month ago

Changes to implement from: https://github.com/ChEB-AI/python-chebai/pull/39#issuecomment-2370456531

Thanks for implementing this.

For the sequence length: I would except the maximum sequence length to refer to the number of amino acids. That way, the same proteins are included in the dataset for a given sequence length, no matter the encoding.

Separate tokens.txt files for each n-gram: Definitely, since they have different sets of tokens (tokens always have length n for each n-gram). This should happen automatically if you change the name property of the reader.

Vocabulary size: That is easy to fix: Simply don't use a pretrained model. Since the pretraining has been done on SMILES, it makes no sense to use that model for protein sequences. (Maybe we will do pretraining for protein sequences in the future, then we will have to pretrain a model with vocab_size=8000)

I will merge this so we can use the classes for other PRs. Please open a new PR for this branch if you have new changes.

aditya0by0 commented 1 month ago

Config:

class_path: chebai.preprocessing.datasets.go_uniprot.GOUniProtOver250
init_args:
  go_branch: "BP"
  reader_kwargs: {n_gram: 3}

aditya0by0 commented 1 month ago

I have completed the changes suggested in our last meeting. Please review.

aditya0by0 commented 3 weeks ago

The next steps here are:

Pretraining: add filter for sequence length as hyperparameter

merge the feature branch into the dev branch

Merging this branch, as suggested in comment https://github.com/ChEB-AI/python-chebai/issues/36#issuecomment-2447698109

A new PR with same branch, will be created for the rest of the changes

ChEB-AI / python-chebai

Protein function prediction with GO - Part 2 #57

PR for the Issue #36

39 (Merged)

57