Create separate commands for embedding, clustering, and distance matrix creation

huddlej commented 1 year ago

The current embed command is doing a lot of work in one interface including creating distance matrices, embedding, and clustering with HDBSCAN. In practice, we often want to build a distance matrix first and then reuse this matrix as input for each different embedding type. We also don't always want to apply clustering to the embeddings. When we do apply clustering, we want to have the ability to run clustering many times on the same input embedding to try different cluster parameters.

I propose that we split the existing embed command into three separate top-level commands:

pathogen-distance to calculate a distance matrix from a given alignment input
pathogen-embed to embed a given alignment and (optional) distance matrix (still creating the matrix on the fly, if it has not been provided)
pathogen-cluster to apply HDSCAN clustering to a given input embedding

Command interfaces might look like the following with optional inputs in square brackets:

pathogen-distance \
    --alignment alignment.fasta \
    [--indel-distance \]
    --output distance_matrix.csv

pathogen-embed \
    [pca, mds, t-sne, umap] \
    --alignment alignment.fasta \
    [--distance-matrix distance_matrix.csv \]
    [--indel-distance \]
    [--random-seed 1234 \]
    --output-data-frame embedding.csv \
    --output-figure embedding.pdf
    [--components 2 --explained-variance \]
    [--perplexity 100 --learning-rate 100 \]
    [--nearest-neighbors 100 --min-dist 0.5]

# Note that original `embed` arguments referring to "cluster"
# drop that prefix here. For example, "--cluster-min-size" is just "--min-size".
pathogen-cluster \
    --embedding embedding.csv \
    [--random-seed 1234 \]
    [--min-size 5 \]
    [--min-samples 5 \]
    [--distance-threshold 1 \]
    --output-data-frame embedding_with_cluster_labels.csv \
    --output-figure embedding_with_cluster_labels.pdf

nandsra21 commented 1 year ago

[x] check pathogen_distance works
[x] check pathogen_cluster works
[x] check pathogen_embed works
[x] submit PR so John can check before I push
[x] Merge PR

nandsra21 commented 1 year ago

The PR: https://github.com/blab/pathogen-embed/pull/2

blab / cartography

Create separate commands for embedding, clustering, and distance matrix creation #33