blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Create separate commands for embedding, clustering, and distance matrix creation #33

Closed huddlej closed 1 year ago

huddlej commented 1 year ago

The current embed command is doing a lot of work in one interface including creating distance matrices, embedding, and clustering with HDBSCAN. In practice, we often want to build a distance matrix first and then reuse this matrix as input for each different embedding type. We also don't always want to apply clustering to the embeddings. When we do apply clustering, we want to have the ability to run clustering many times on the same input embedding to try different cluster parameters.

I propose that we split the existing embed command into three separate top-level commands:

Command interfaces might look like the following with optional inputs in square brackets:

pathogen-distance \
    --alignment alignment.fasta \
    [--indel-distance \]
    --output distance_matrix.csv

pathogen-embed \
    [pca, mds, t-sne, umap] \
    --alignment alignment.fasta \
    [--distance-matrix distance_matrix.csv \]
    [--indel-distance \]
    [--random-seed 1234 \]
    --output-data-frame embedding.csv \
    --output-figure embedding.pdf
    [--components 2 --explained-variance \]
    [--perplexity 100 --learning-rate 100 \]
    [--nearest-neighbors 100 --min-dist 0.5]

# Note that original `embed` arguments referring to "cluster"
# drop that prefix here. For example, "--cluster-min-size" is just "--min-size".
pathogen-cluster \
    --embedding embedding.csv \
    [--random-seed 1234 \]
    [--min-size 5 \]
    [--min-samples 5 \]
    [--distance-threshold 1 \]
    --output-data-frame embedding_with_cluster_labels.csv \
    --output-figure embedding_with_cluster_labels.pdf
nandsra21 commented 1 year ago
nandsra21 commented 1 year ago

The PR: https://github.com/blab/pathogen-embed/pull/2