blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Process HA and NA alignments separately #122

Closed huddlej closed 1 month ago

huddlej commented 1 month ago

Description

Replaces logic for running distance calculations and embeddings on either an HA alignment or a concatenated HA/NA alignment with logic to get distances and embeddings from HA and NA alignments separately. This works because pathogen-embed supports multiple input values to its alignment and distance matrix arguments. The benefit of this change is that the pathogen-distance command can calculate distances that ignore leading and trailing gaps in each gene's alignment that would otherwise be counted in the concatenated alignment. Since we did not calculate indel distances for the HA/NA analysis, this change to the workflow should only affect the PCA embeddings. Since the new simplex encoding of PCA inputs effectively ignores gaps, even the PCA embeddings should be minimally affected by this change. However, the most important aspect of this change is the demonstration of how we recommend these tools to be used for this kind of reassortment analysis.

Related issues

Closes #121

huddlej commented 1 month ago

Actually, this change is not something we support yet now that we include genetic distance clusters in the analysis. The pathogen-cluster command does not yet accept the multiple distance matrix inputs that we would need to support for HA/NA genetic distance clusters. I'm closing this to focus on more critical revisions, but we could revisit this concept later. I created an issue in pathogen-embed about this.