blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
3 stars 1 forks source link

Plot embeddings by clade with branches from the phylogeny for early and late H3N2 HA and early SC2 to show how embeddings correspond to phylogeny #78

Closed huddlej closed 6 months ago

huddlej commented 6 months ago

The current approach to connecting embeddings with the phylogeny requires readers to compare overly similar colors between tree and embedding panels. Instead, we can plot the embeddings with the branches from the phylogeny, directly showing how groups of samples are related. To do this, we need to:

One approach to including branches involves inferring the ancestral sequences for each internal node of the tree and creating separate embeddings with the tip and internal node sequences. This approach places the internal node sequence next to tips using the logic of each embedding method. It also biases the embeddings by including redundant information from internal nodes that are closely related to tips (increasing the density of samples in some parts of the embedding), including details from the phylogeny such that embeddings don't reflect an independent method but a reinterpretation of the existing phylogenetic method, and including details about the sample collection date and viral clock rate that augur refine and ancestral commands use to infer the most likely internal node sequences.

Another approach to including branches would be to use only the branching pattern from the phylogeny to connect related tips with branches in the embeddings and to place the internal nodes between each pair of nodes in the phylogeny using the midpoint in Euclidean embedding space between those nodes. This approach retains the nested relationships between tips and draws them as branches, but the embeddings themselves are not influenced by any information from the phylogeny or the collection date metadata. An advantage of this approach is that it does not require a separate set of embeddings with and without ancestral sequences and instead only requires an overlay of the tree on top of the existing embeddings (an update to plotting logic instead of the workflow logic). The main disadvantage of this approach would be that internal node placements in Euclidean space would represent a cladogram instead of a dendrogram without meaningful branch lengths in Euclidean space. That said, we know that Euclidean distances above a certain value in all embeddings except MDS do not correspond meaningfully to genetic distance, so branch lengths in the first method will be distorted, too.

Another benefit of the second approach is that we could choose which depth of the tree to plot branches for. For example, we could choose to only plot branches between the major clades. We could use the same algorithm I described above to find the internal node positions, but we could filter which internal nodes to plot to just those at the clade level.