Determine the effect of sampling bias (e.g., regional or clade composition bias) on embeddings and cluster accuracy

In addition to our analysis of how the number of sequences in each subsampling affects the VI distances for late flu and SC2 data, we should quantify the effect of sampling bias, too.

https://github.com/blab/cartography/pull/69 starts this analysis by adding a separate flu workflow that doesn't do any subsampling prior to running the same embeddings and clusterings that the main flu workflows do. This approach results in data that are strongly biased regionally toward the USA and phylogenetically toward 3c3.A.

A better approach might be to update the replication analysis mentioned above to use two different subsampling schemes: even sampling and no subsampling. Then we could use the same workflow machinery to quantify the effects of both the number of samples and the bias of the samples.

blab / cartography

Determine the effect of sampling bias (e.g., regional or clade composition bias) on embeddings and cluster accuracy #82