Fix data leakage in CV in variant prediction example

facebookresearch / esm

Evolutionary Scale Modeling (esm): Pretrained language models for proteins

MIT License

3.16k stars 627 forks source link

Fix data leakage in CV in variant prediction example #147

Closed kiramt closed 2 years ago

kiramt commented 2 years ago

In the variant prediction example notebook esm/examples/sup_variant_prediction.ipynb the top 60 principal components are selected and used to reduce the dimensionality of the training set. By doing this before running CV there is data leakage into the cross-validation sets. [(https://github.com/facebookresearch/esm/discussions/140)]

This PR pushes the selection of the principal components to inside the CV step.

tomsercu commented 2 years ago

Looks good to me. Thanks for improving our example notebook!