Select a Nextstrain clade from the early SC2 dataset like Delta (21J) that has poor resolution of Pango lineages in all embeddings, run all four methods on just the Delta/21J samples, and calculate distances of clusters from that clade to the Pango lineages for the same samples. This answers the question of whether we improve the cluster resolution by focusing on specific clades or time periods. Add this result to supplement (add also an embedding/tree figure for late SC2 colored by Pango lineage).
I would add this analysis as a sub-analysis of the early SC2 workflow with new rules for the following steps:
Filter metadata to only samples from Nextclade clade 21J
Filter the aligned FASTA file to only the samples from 21J
Create a distance matrix for 21J alignment
Create a "clean" alignment for PCA from the 21J alignment
Run all four embedding methods on the corresponding alignment and/or distance matrix
Find clusters in each embedding with the optimal cluster parameters
Calculate cluster accuracy per method compared to collapsed Pango lineages from the 21J metadata file
Report this cluster accuracy in the results for early SC2
Select a Nextstrain clade from the early SC2 dataset like Delta (21J) that has poor resolution of Pango lineages in all embeddings, run all four methods on just the Delta/21J samples, and calculate distances of clusters from that clade to the Pango lineages for the same samples. This answers the question of whether we improve the cluster resolution by focusing on specific clades or time periods. Add this result to supplement (add also an embedding/tree figure for late SC2 colored by Pango lineage).
I would add this analysis as a sub-analysis of the early SC2 workflow with new rules for the following steps: