Closed aditya0by0 closed 3 weeks ago
Changes to implement from: https://github.com/ChEB-AI/python-chebai/pull/39#issuecomment-2370456531
Thanks for implementing this.
- For the sequence length: I would except the maximum sequence length to refer to the number of amino acids. That way, the same proteins are included in the dataset for a given sequence length, no matter the encoding.
- Separate
tokens.txt
files for each n-gram: Definitely, since they have different sets of tokens (tokens always have length n for each n-gram). This should happen automatically if you change thename
property of the reader.- Vocabulary size: That is easy to fix: Simply don't use a pretrained model. Since the pretraining has been done on SMILES, it makes no sense to use that model for protein sequences. (Maybe we will do pretraining for protein sequences in the future, then we will have to pretrain a model with
vocab_size=8000
)I will merge this so we can use the classes for other PRs. Please open a new PR for this branch if you have new changes.
Config:
class_path: chebai.preprocessing.datasets.go_uniprot.GOUniProtOver250
init_args:
go_branch: "BP"
reader_kwargs: {n_gram: 3}
I have completed the changes suggested in our last meeting. Please review.
The next steps here are:
- Pretraining: add filter for sequence length as hyperparameter
- merge the feature branch into the dev branch
Merging this branch, as suggested in comment https://github.com/ChEB-AI/python-chebai/issues/36#issuecomment-2447698109
A new PR with same branch, will be created for the rest of the changes
PR for the Issue #36
Note: The above issue will be implemented in 2 PRs:
39 (Merged)
57
Tasks