blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Revisit how to create mutation tables for HA/NA analysis #64

Closed huddlej closed 10 months ago

huddlej commented 10 months ago

We currently build our HA/NA mutations table from a concatenated reference of HA and NA sequences and concatenated alignments of HA and NA, but the resulting coordinates for mutations in NA can't correspond to meaningful positions without translating coordinates from the concatenated alignments to the NA alignments. We should revisit how we create mutation tables for multiple segments like this.

Another approach that might be better for embedding segmented genomes would be to accept multiple alignment and distance matrix inputs to pathogen-embed (one per segment). For distance-based embeddings, we can independently calculate pairwise distances for each segment and then sum the matrices by cell to get the distances across segments prior to embedding. For alignment-based embeddings like PCA, we could transform the alignments to character states independently (and, optionally, in parallel) and concatenate those matrices internally prior to PCA.

# User has to concat HA and NA alignments themselves.
pathogen-embed \
  --alignment ha_na.fasta \
  --output-dataframe pca.csv \
  pca

# Or pathogen-embed concats alignments for the user behind the scenes.
pathogen-embed \
  --alignment ha.fasta na.fasta \
  --output-dataframe pca.csv \
  pca

pathogen-distance \
  --alignment ha.fasta \
  --output ha_distances.csv

pathogen-distance \
  --alignment na.fasta \
  --output na_distances.csv

pathogen-embed \
  --alignment ha.fasta na.fasta \
  --distance-matrix ha_distances.csv na_distances.csv \
  --output-dataframe tsne.csv \
  t-sne
  --perplexity 100
  --learning-rate 100

With the above approach, we'd still produce the same embeddings but without explicitly concatenating alignments. Then, we could build mutation tables for clusters per gene segment with a column in the output indicating the segment and concatenate the resulting mutation tables. When we group the concatenated mutation tables by a given embedding cluster, we'd get mutations from each segment in that segment's original coordinate space. Actually, there isn't anything stopping us from using this approach to building the mutation tables even if we continue to use concatenated alignments as input to the embeddings. We should still get the same mutations per segment that we get from the concatenated alignments, but the mutations will have the correct coordinates for their segments.

rule seasonal_flu_reassortment_create_mutation_table:
    input:
        reference = "ha-na-nextstrain/config/reference_h3n2_{segment}.fasta",
        alignment = "ha-na-nextstrain/results/aligned_{segment}.fasta",
        embedding = "ha-na-nextstrain/results/cluster_embed_{method}_concatenated.csv",
    output:
        table = "ha-na-nextstrain/results/mutation_table_{method}_{segment}.csv",
    params:
        min_allele_count=10,
        min_allele_frequency=0.5,
    conda: "../cartography.yml"
    benchmark:
        "benchmarks/seasonal_flu_reassortment_create_mutation_table_{method}_{segment}.txt"
    log:
        "logs/seasonal_flu_reassortment_create_mutation_table_{method}_{segment}.txt"
    shell:
        """
        python3 notebooks/scripts/cluster_mutation.py \
            --reference-sequence {input.reference} \
            --alignment {input.alignment} \
            --metadata {input.embedding} \
            --metadata-column {wildcards.method}_concatenated_label \
            --min-allele-count {params.min_allele_count} \
            --min-allele-frequency {params.min_allele_frequency} \
            --annotations segment={wildcards.segment} \
            --output {output.table} 2>&1 | tee {log}
        """

rule seasonal_flu_reassortment_concat_mutation_tables:
    input:
        mutation_tables=expand("ha-na-nextstrain/results/mutation_table_{method}_{segment}.csv", method=EMBEDDING_METHODS, segment=SEGMENTS),
    output:
        metadata = "ha-na-nextstrain/results/mutation_table.csv",
    params:
        column = "metadata_column",
    conda: "../cartography.yml"
    shell:
        """
        python3 notebooks/scripts/concatenate_tables.py \
            --tables {input.mutation_tables} \
            --separator ',' \
            --sort-by {params.column} \
            --output {output.metadata}
        """