blab / pathogen-embed

Create reduced dimension embeddings for pathogen sequences
https://pypi.org/project/pathogen-embed/
MIT License
1 stars 0 forks source link

embed: Support multiple input files for alignments and distance matrices #10

Closed huddlej closed 3 months ago

huddlej commented 7 months ago

Context

To produce embeddings for multiple gene segments like HA and NA for influenza H3N2, we currently concatenate the alignments for each gene to create a single alignment file and then calculate the distance matrix from that concatenated alignment. This concatenation step requires additional work from the user, though, that could be easily performed by the pathogen-embed command.

Description

Ideally, users could provide multiple input files for both alignments and distance matrices to the pathogen-embed command. In this way, users could precalculate a distance matrix per gene segment and let the embed command add the distances matrices internally. The interface might look like this:

# Create distance matrix for H3N2 HA alignment.
pathogen-distance \
  --alignment h3n2_ha_alignment.fasta \
  --output h3n2_ha_distances.csv

# Create distance matrix for H3N2 NA alignment.
pathogen-distance \
  --alignment h3n2_na_alignment.fasta \
  --output h3n2_na_distances.csv

# Run MDS on the HA and NA distances.
pathogen-embed \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_mds.csv \
  mds

# Run t-SNE on HA and NA alignments (for PCA initialization) and distances.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE on HA and NA alignments for PCA initialization and to calculate the distance matrix on the fly.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE with an HA alignment for PCA initialization and HA/NA distance matrices for the embedding.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

This approach allows each distance matrix to be produced in parallel, for example in a Snakemake workflow, which will speed up a computationally expensive part of the analysis.

Possible solution

To support this new functionality, the pathogen-embed command needs to:

  1. accept one or more arguments to --alignment and --distance-matrix
  2. load all given alignment files and, if more than one file is given, concatenate the alignments before running embeddings
  3. load all given distance matrix files and, if more than one file is given, sum the distances from all matrices into a single distance matrix before running embeddings

In the case where the user only provides alignments and the embedding requires a distance matrix, the command's current logic remains unchanged and operates on the concatenated alignment it produces from step 2 above.

It should be possible for the user to provide a single alignment file to use for PCA initialization of t-SNE, for example, and also provide multiple distance matrices to use for the embedding.

nandsra21 commented 5 months ago

Working Implementation


pathogen-distance \
  --alignment h3n2_ha_alignment.fasta \
  --output h3n2_ha_distances.csv

# Create distance matrix for H3N2 NA alignment.
pathogen-distance \
  --alignment h3n2_na_alignment.fasta \
  --output h3n2_na_distances.csv

# Run MDS on the HA and NA distances.
# Change: must add an alignment
pathogen-embed \
 --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_mds.csv \
  mds

# Run t-SNE on HA and NA alignments (for PCA initialization) and distances.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE on HA and NA alignments for PCA initialization and to calculate the distance matrix on the fly.
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne

# Run t-SNE with an HA alignment for PCA initialization and HA/NA distance matrices for the embedding.
# Change: same number of alignments as distance matrices
pathogen-embed \
  --alignment h3n2_ha_alignment.fasta h3n2_na_alignment.fasta \
  --distance-matrix h3n2_ha_distances.csv h3n2_na_distances.csv \
  --output-dataframe h3n2_ha_na_t-sne.csv \
  t-sne```