The current approach to connecting embeddings with the phylogeny requires readers to compare overly similar colors between tree and embedding panels. Instead, we can plot the embeddings with the branches from the phylogeny, directly showing how groups of samples are related. To do this, we need to:
[x] Replace time trees with divergence trees in all workflows such that the branching pattern plotting on embeddings reflects only genetic data and not the additional metadata that informs the time tree topologies
[x] Update the workflows for early/late flu and early SC2 to produce a data frame with embedding coordinates for all internal nodes and tips
[x] Early flu
[x] Late flu
[x] Early SC2
[x] Update plotting notebooks to generate parent/child links between tips and their internal nodes for each embedding panel (analogous to what we already do to plot the tree by linking parental divergence and y-axis positions to tips)
[x] Plot early/late flu and early SC2 embeddings with branches
[ ] Update main and supplemental text to reflect updates figures
[x] Methods
[ ] Early flu
[ ] Late flu
[ ] Early SC2
One approach to including branches involves inferring the ancestral sequences for each internal node of the tree and creating separate embeddings with the tip and internal node sequences. This approach places the internal node sequence next to tips using the logic of each embedding method. It also biases the embeddings by including redundant information from internal nodes that are closely related to tips (increasing the density of samples in some parts of the embedding), including details from the phylogeny such that embeddings don't reflect an independent method but a reinterpretation of the existing phylogenetic method, and including details about the sample collection date and viral clock rate that augur refine and ancestral commands use to infer the most likely internal node sequences.
Another approach to including branches would be to use only the branching pattern from the phylogeny to connect related tips with branches in the embeddings and to place the internal nodes between each pair of nodes in the phylogeny using the midpoint in Euclidean embedding space between those nodes. This approach retains the nested relationships between tips and draws them as branches, but the embeddings themselves are not influenced by any information from the phylogeny or the collection date metadata. An advantage of this approach is that it does not require a separate set of embeddings with and without ancestral sequences and instead only requires an overlay of the tree on top of the existing embeddings (an update to plotting logic instead of the workflow logic). The main disadvantage of this approach would be that internal node placements in Euclidean space would represent a cladogram instead of a dendrogram without meaningful branch lengths in Euclidean space. That said, we know that Euclidean distances above a certain value in all embeddings except MDS do not correspond meaningfully to genetic distance, so branch lengths in the first method will be distorted, too.
Another benefit of the second approach is that we could choose which depth of the tree to plot branches for. For example, we could choose to only plot branches between the major clades. We could use the same algorithm I described above to find the internal node positions, but we could filter which internal nodes to plot to just those at the clade level.
The current approach to connecting embeddings with the phylogeny requires readers to compare overly similar colors between tree and embedding panels. Instead, we can plot the embeddings with the branches from the phylogeny, directly showing how groups of samples are related. To do this, we need to:
One approach to including branches involves inferring the ancestral sequences for each internal node of the tree and creating separate embeddings with the tip and internal node sequences. This approach places the internal node sequence next to tips using the logic of each embedding method. It also biases the embeddings by including redundant information from internal nodes that are closely related to tips (increasing the density of samples in some parts of the embedding), including details from the phylogeny such that embeddings don't reflect an independent method but a reinterpretation of the existing phylogenetic method, and including details about the sample collection date and viral clock rate that augur refine and ancestral commands use to infer the most likely internal node sequences.
Another approach to including branches would be to use only the branching pattern from the phylogeny to connect related tips with branches in the embeddings and to place the internal nodes between each pair of nodes in the phylogeny using the midpoint in Euclidean embedding space between those nodes. This approach retains the nested relationships between tips and draws them as branches, but the embeddings themselves are not influenced by any information from the phylogeny or the collection date metadata. An advantage of this approach is that it does not require a separate set of embeddings with and without ancestral sequences and instead only requires an overlay of the tree on top of the existing embeddings (an update to plotting logic instead of the workflow logic). The main disadvantage of this approach would be that internal node placements in Euclidean space would represent a cladogram instead of a dendrogram without meaningful branch lengths in Euclidean space. That said, we know that Euclidean distances above a certain value in all embeddings except MDS do not correspond meaningfully to genetic distance, so branch lengths in the first method will be distorted, too.
Another benefit of the second approach is that we could choose which depth of the tree to plot branches for. For example, we could choose to only plot branches between the major clades. We could use the same algorithm I described above to find the internal node positions, but we could filter which internal nodes to plot to just those at the clade level.