Edits from Alli: e69c2c81660c5293043dd9ff369c7ac2f1e90be6
Comments/edits from Trevor
71
72
73
74
75
Specific tasks based on Trevor's feedback:
[x] #77
[x] #78
[x] Clarify that Table 1 represents the optimal cluster thresholds and their VI values based on training data and not the VI values for all datasets (resolved by 134f9ca)
[x] Clarify the purpose of or edit down the prominence of the early/late split for flu and SC2 (resolved by 4a3d3077)
[x] State clearly in the abstract and conclusion which single method we recommend for most uses
[x] Explicitly state in the discussion that automated labeling of reassorted or recombinant lineages is future work and outside the scope of this project (resolved by 86ac1a5f)
[x] #79
[x] Clarify reliability of genetic clusters from embeddings without relying entirely on expert clade definitions (resolved by 9bfbb951)
Relationship between Euclidean/genetic distances shows how reliable embeddings are up to a certain genetic distance which is a key component of how clusters get defined
Within/between group distances for most clusters match those of expert clades, suggesting that the diversity captured by clusters is comparable to what an expert would assign manually
Cluster accuracy for each method depends in part on sampling density
Cluster accuracy also depends on minimizing sampling bias
71
72
73
74
75
Specific tasks based on Trevor's feedback:
Specific request from Nidia: