blab / pathogen-embed

Create reduced dimension embeddings for pathogen sequences
https://pypi.org/project/pathogen-embed/
MIT License
1 stars 0 forks source link

Add optional output from pathogen-embed that produces the boxplot figure of Euclidean by genetic distance #8

Closed huddlej closed 5 months ago

huddlej commented 7 months ago

In our analysis of embeddings for flu and SC2, we find that plotting pairwise Euclidean embedding distances against genetic distances is a helpful diagnostic visualization (e.g., distances for early flu HA embeddings). The boxplot implementation from the cartography project allows us to see how well each embedding maintains local, intermediate, and global structure. This view also allows us to find the Euclidean distance in each embedding that corresponds to a genetic distance to use for cluster definitions. Once we know how minimally far apart we'd like clusters to be from each other genetically, these plots show the Euclidean distance to use for the --distance-threshold argument to the pathogen-cluster command.

So, to support this kind of diagnostic for all users, we should add an optional output argument to the pathogen-embed command (shared across all method subcommands) that allows the user to specify the name of the figure that will contain the Euclidean distance by genetic distance boxplot. When this argument is provided, the command should do the following after the embedding is produced:

  1. calculate pairwise Euclidean distances from the embedding
  2. use the provided distance matrix or calculate the distance matrix from the given alignment to plot Euclidean distance by genetic distance for each pair of samples in a boxplot
  3. save the figure to the requested file

The argument could be named --output-pairwise-distance-figure or something like that...