ChEB-AI / python-chebai

GNU Affero General Public License v3.0
12 stars 4 forks source link

Protein function prediction with GO - Part 2 #57

Closed aditya0by0 closed 3 weeks ago

aditya0by0 commented 1 month ago

Note: The above issue will be implemented in 2 PRs:

Tasks

aditya0by0 commented 1 month ago

Changes to implement from: https://github.com/ChEB-AI/python-chebai/pull/39#issuecomment-2370456531

Thanks for implementing this.

  • For the sequence length: I would except the maximum sequence length to refer to the number of amino acids. That way, the same proteins are included in the dataset for a given sequence length, no matter the encoding.
  • Separate tokens.txt files for each n-gram: Definitely, since they have different sets of tokens (tokens always have length n for each n-gram). This should happen automatically if you change the name property of the reader.
  • Vocabulary size: That is easy to fix: Simply don't use a pretrained model. Since the pretraining has been done on SMILES, it makes no sense to use that model for protein sequences. (Maybe we will do pretraining for protein sequences in the future, then we will have to pretrain a model with vocab_size=8000)

I will merge this so we can use the classes for other PRs. Please open a new PR for this branch if you have new changes.

aditya0by0 commented 1 month ago

Config:

class_path: chebai.preprocessing.datasets.go_uniprot.GOUniProtOver250
init_args:
  go_branch: "BP"
  reader_kwargs: {n_gram: 3}
aditya0by0 commented 1 month ago

I have completed the changes suggested in our last meeting. Please review.

aditya0by0 commented 3 weeks ago

The next steps here are:

  • Pretraining: add filter for sequence length as hyperparameter
  • merge the feature branch into the dev branch

Merging this branch, as suggested in comment https://github.com/ChEB-AI/python-chebai/issues/36#issuecomment-2447698109

A new PR with same branch, will be created for the rest of the changes