blab / pathogen-embed

Create reduced dimension embeddings for pathogen sequences
https://pypi.org/project/pathogen-embed/
MIT License
1 stars 0 forks source link

Export table of mutations per cluster in a new command #20

Closed huddlej closed 2 weeks ago

huddlej commented 5 months ago

Description

For the cartography paper, we generated a table of mutations per cluster to get a sense of what distinguishes clusters from each other and whether those cluster differences could be biologically meaningful. We implemented this logic in a standalone script, but it seems natural to want this information as an optional output from the pathogen-embed tools. For example, these mutation tables could allow users to interrogate their clusters in detail and refine the cluster thresholds and other parameters for their pathogens.

The required arguments to produce this mutations table would be similar to the standalone script linked above:

Initially, I thought we might add these arguments to the pathogen-cluster command, so users could get a mutation table along with their clusters. However, I'm worried about complicating the default interface for pathogen-cluster with all of these additional arguments. The mutation table functionality could live in a standalone command with the same interface as the standalone script above. By decoupling the mutation table logic from cluster identification, we allow ourselves and other users to apply the mutation table logic to other previously assigned genetic groups like clades or MCCs.

For this reason, I think we could copy most of the original script's code into a new pathogen-cluster-mutations command including the bits where we load cluster labels from different metadata input files.