blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Compare MDS in flu with 2 vs 4 components in grid search #11

Closed huddlej closed 3 years ago

huddlej commented 3 years ago

MDS with 2 components produces noticeably different embeddings and HDBSCAN clusters than MDS with 4 components.

For example, this is the seasonal flu MDS embedding with 2 components:

EBA7DA05-06E2-4F96-9DDD-259F16B167F4

And this is the MDS embedding for the same data with 4 components (only first two are shown, but note differences in clusters):

5BA72D38-24AC-4A9A-8603-A717357729A7

The correlation between genetic distance and embedding distance is much higher for the 4-component embedding (as we would expected), but we don't know if this embedding produces more accurate clusters.

We should update the grid search parameters file for seasonal flu's training data to include a column for MDS's n_components and re-run the grid search with these different values. We should update the script to summarize grid search results to identify the optimal number of components for MDS from validation MCC like we do for t-SNE and UMAP parameters. Then we should re-run the full MDS embedding with the optimal values and update the manuscript accordingly.

nandsra21 commented 3 years ago

Just pushed a fix for this to the four main workflows, as well as summarize-grid-search. Whatever the best value is can be inputted into the embed script manually