blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
3 stars 1 forks source link

Determine the effect of sampling bias (e.g., regional or clade composition bias) on embeddings and cluster accuracy #82

Closed huddlej closed 6 months ago

huddlej commented 6 months ago

In addition to our analysis of how the number of sequences in each subsampling affects the VI distances for late flu and SC2 data, we should quantify the effect of sampling bias, too.

https://github.com/blab/cartography/pull/69 starts this analysis by adding a separate flu workflow that doesn't do any subsampling prior to running the same embeddings and clusterings that the main flu workflows do. This approach results in data that are strongly biased regionally toward the USA and phylogenetically toward 3c3.A.

A better approach might be to update the replication analysis mentioned above to use two different subsampling schemes: even sampling and no subsampling. Then we could use the same workflow machinery to quantify the effects of both the number of samples and the bias of the samples.