The current embed command is doing a lot of work in one interface including creating distance matrices, embedding, and clustering with HDBSCAN. In practice, we often want to build a distance matrix first and then reuse this matrix as input for each different embedding type. We also don't always want to apply clustering to the embeddings. When we do apply clustering, we want to have the ability to run clustering many times on the same input embedding to try different cluster parameters.
I propose that we split the existing embed command into three separate top-level commands:
pathogen-distance to calculate a distance matrix from a given alignment input
pathogen-embed to embed a given alignment and (optional) distance matrix (still creating the matrix on the fly, if it has not been provided)
pathogen-cluster to apply HDSCAN clustering to a given input embedding
Command interfaces might look like the following with optional inputs in square brackets:
pathogen-distance \
--alignment alignment.fasta \
[--indel-distance \]
--output distance_matrix.csv
pathogen-embed \
[pca, mds, t-sne, umap] \
--alignment alignment.fasta \
[--distance-matrix distance_matrix.csv \]
[--indel-distance \]
[--random-seed 1234 \]
--output-data-frame embedding.csv \
--output-figure embedding.pdf
[--components 2 --explained-variance \]
[--perplexity 100 --learning-rate 100 \]
[--nearest-neighbors 100 --min-dist 0.5]
# Note that original `embed` arguments referring to "cluster"
# drop that prefix here. For example, "--cluster-min-size" is just "--min-size".
pathogen-cluster \
--embedding embedding.csv \
[--random-seed 1234 \]
[--min-size 5 \]
[--min-samples 5 \]
[--distance-threshold 1 \]
--output-data-frame embedding_with_cluster_labels.csv \
--output-figure embedding_with_cluster_labels.pdf
The current
embed
command is doing a lot of work in one interface including creating distance matrices, embedding, and clustering with HDBSCAN. In practice, we often want to build a distance matrix first and then reuse this matrix as input for each different embedding type. We also don't always want to apply clustering to the embeddings. When we do apply clustering, we want to have the ability to run clustering many times on the same input embedding to try different cluster parameters.I propose that we split the existing
embed
command into three separate top-level commands:pathogen-distance
to calculate a distance matrix from a given alignment inputpathogen-embed
to embed a given alignment and (optional) distance matrix (still creating the matrix on the fly, if it has not been provided)pathogen-cluster
to apply HDSCAN clustering to a given input embeddingCommand interfaces might look like the following with optional inputs in square brackets: