blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Optimize parameters with simulations and cross-validation #28

Closed huddlej closed 1 year ago

huddlej commented 1 year ago

We previously optimized method parameters by maximizing the Pearson's correlation between genetic and Euclidean distances for a complete embedding. This approach is actually invalid, though, since methods like MDS will always prefer the addition of more dimensions/components to account for additional variance in the input data, leading to overfitting. This PR replaces that previous approach with an optimization by cross-validation where we fit full embeddings to get the truth set (or observed results), create train/test splits with time series cross-validation (splitting on "generation" from simulations as the representation of time), fit embeddings with training data, fit a linear model between Euclidean and genetic distances from the training embedding, estimate the Euclidean distances for test data that were held out, and calculate the mean squared error (MSE) between observed and estimated Euclidean distances for test data. We then select as optimal the method parameters that minimize the MSE.