MDS with 2 components produces noticeably different embeddings and HDBSCAN clusters than MDS with 4 components.
For example, this is the seasonal flu MDS embedding with 2 components:
And this is the MDS embedding for the same data with 4 components (only first two are shown, but note differences in clusters):
The correlation between genetic distance and embedding distance is much higher for the 4-component embedding (as we would expected), but we don't know if this embedding produces more accurate clusters.
We should update the grid search parameters file for seasonal flu's training data to include a column for MDS's n_components and re-run the grid search with these different values. We should update the script to summarize grid search results to identify the optimal number of components for MDS from validation MCC like we do for t-SNE and UMAP parameters. Then we should re-run the full MDS embedding with the optimal values and update the manuscript accordingly.
Just pushed a fix for this to the four main workflows, as well as summarize-grid-search. Whatever the best value is can be inputted into the embed script manually
MDS with 2 components produces noticeably different embeddings and HDBSCAN clusters than MDS with 4 components.
For example, this is the seasonal flu MDS embedding with 2 components:
And this is the MDS embedding for the same data with 4 components (only first two are shown, but note differences in clusters):
The correlation between genetic distance and embedding distance is much higher for the 4-component embedding (as we would expected), but we don't know if this embedding produces more accurate clusters.
We should update the grid search parameters file for seasonal flu's training data to include a column for MDS's
n_components
and re-run the grid search with these different values. We should update the script to summarize grid search results to identify the optimal number of components for MDS from validation MCC like we do for t-SNE and UMAP parameters. Then we should re-run the full MDS embedding with the optimal values and update the manuscript accordingly.