How to optimize the hyperparameter in proteinNPT training process

Meron-TANG commented 4 months ago

Hi there,

Could you provide more details about the training set? In the ProteinNPT paper, you mention optimizing the model's hyperparameters using validation dataset. However, in the ProteinGym benchmark dataset, there is only a 5-fold split for each assay. Could you clarify if you optimized the hyperparameters using the validation set from just one assay and then applied them to the others? If not, what validation approach did you use? If so, which assay's validation set did you use, and how did you split in that dataset?

Thank you

pascalnotin commented 3 months ago

Hi @Meron-TANG, We used a representative subset of 8 assays (referenced in appendix B2 of the paper) to select hypers during model development. We looked at the 3 different 5-fold CV schemes (random, modulo, contiguous) on this subset of assays. We then consistently used these final hypers (compiled in our model config file) to train on / score all 217 assays from the ProteinGym substitution benchmark. Best, Pascal

Meron-TANG commented 3 months ago

thank you @pascalnotin !

OATML-Markslab / ProteinNPT

How to optimize the hyperparameter in proteinNPT training process #16