We previously optimized method parameters by maximizing the Pearson's correlation between genetic and Euclidean distances for a complete embedding. This approach is actually invalid, though, since methods like MDS will always prefer the addition of more dimensions/components to account for additional variance in the input data, leading to overfitting. This PR replaces that previous approach with an optimization by cross-validation where we fit full embeddings to get the truth set (or observed results), create train/test splits with time series cross-validation (splitting on "generation" from simulations as the representation of time), fit embeddings with training data, fit a linear model between Euclidean and genetic distances from the training embedding, estimate the Euclidean distances for test data that were held out, and calculate the mean squared error (MSE) between observed and estimated Euclidean distances for test data. We then select as optimal the method parameters that minimize the MSE.
We previously optimized method parameters by maximizing the Pearson's correlation between genetic and Euclidean distances for a complete embedding. This approach is actually invalid, though, since methods like MDS will always prefer the addition of more dimensions/components to account for additional variance in the input data, leading to overfitting. This PR replaces that previous approach with an optimization by cross-validation where we fit full embeddings to get the truth set (or observed results), create train/test splits with time series cross-validation (splitting on "generation" from simulations as the representation of time), fit embeddings with training data, fit a linear model between Euclidean and genetic distances from the training embedding, estimate the Euclidean distances for test data that were held out, and calculate the mean squared error (MSE) between observed and estimated Euclidean distances for test data. We then select as optimal the method parameters that minimize the MSE.