blab / cartography

Dimensionality reduction distills complex evolutionary relationships in seasonal influenza and SARS-CoV-2
https://doi.org/10.1101/2024.02.07.579374
MIT License
4 stars 1 forks source link

Cluster specific mutations #25

Closed huddlej closed 1 year ago

huddlej commented 1 year ago

One approach to finishing this functionality would be to use BioPython's "dumb" consensus function inside a single for loop through the different cluster ids. When we read sequences in from the alignment, we can keep them as SeqRecord instances so the algorithm looks like:

  1. Read sequences as SeqRecords into mapping of strain name to SeqRecord
  2. Read mapping of strain name to cluster id
  3. For cluster id in cluster ids
    1. Create list of SeqRecords from strains in the cluster
    2. Create MultipleSeqAlignment from records list
    3. Create dumb consensus from MultipleSeqAlignment
    4. Write dumb consensus (named by cluster id) to open consensus FASTA file handle

We also want to parameterize the cluster id from the metadata using a --group-by argument to the consensus script, so we can pass in "MCC", "clade_membership", "mds_label", etc. The update proposed to the MERS Snakefile in this PR shows an example of how we want to parameterize the Snakemake rule for consensus sequences by embedding method, so we can get cluster-specific mutations per method. The final consensus table will need to include a column for the embedding method along with the pathogen, position, and mutation information that it already includes.