blab / pathogen-embed

Create reduced dimension embeddings for pathogen sequences
https://pypi.org/project/pathogen-embed/
MIT License
1 stars 0 forks source link

Feature boxplot figure #14

Closed nandsra21 closed 5 months ago

nandsra21 commented 5 months ago

adds enhancement described in https://github.com/blab/pathogen-embed/issues/8

huddlej commented 5 months ago

@nandsra21 Thank you for getting this started! I rebased this branch onto main just to keep this branch's changes separate from those in other branches and I pushed a couple of commits that convert the plotting logic to use base matplotlib, so we don't need to keep the seaborn dependency. In some ways, commit 116372c simplifies logic a bit by removing the "scatterplot" function (which we needed in the context of the paper's analyses) in favor of working with scipy's "compressed distance matrix" format where we know each entry in the genetic distance matrix maps one-to-one with the corresponding position in the Euclidean distance matrix. In other ways, the commit makes the logic more complex by needing to group Euclidean distances by genetic distance for matplotlib's boxplot function. Also, matplotlib's boxplot function is way more verbosely configured than the seaborn implementation!

For your reference, here is what the boxplots looked like with the seaborn implementation you originally pushed to the branch:

image

And here is what the plots look like after my changes:

image

One important subtle difference is that the new implementation creates a boxplot entry on the x-axis for every integer value between the min and max genetic distance even for genetic distances we didn't observe. This means the observed genetic distance of 76 appears on the x-axis at position x=76 (see the small gray median line tick), while our original seaborn implementation placed that same distance value at position 71 (see the small blue/gray tick in the original plot above). This means our original plots for the paper need to be updated, too, to make sure that the boxes are placed on the x-axis position that matches the genetic distance.