blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Calculate variation of information (VI) between groups of tips with known clade labels and clusters #35

Closed huddlej closed 1 year ago

huddlej commented 1 year ago

Calculate variation of information (VI) between groups of tips with known clade labels and clusters assigned by HDBSCAN.

This approach should support the hierarchical information present in clade annotations such that two tips that map to different closely related clades are closer to each other than two tips that map to different divergent clades.

huddlej commented 1 year ago

I tried calculating bootstrapped VI values to get confidence intervals, but you can't calculate VI when sampling with replacement or you end up with the same id in the analysis multiple times which breaks the set-based logic for VI. You can calculate CIs without replacement by sampling a fraction of the full data, but you end up with biased VI estimates based on the number of samples you include in the analysis.