One approach to finishing this functionality would be to use BioPython's "dumb" consensus function inside a single for loop through the different cluster ids. When we read sequences in from the alignment, we can keep them as SeqRecord instances so the algorithm looks like:
Read sequences as SeqRecords into mapping of strain name to SeqRecord
Read mapping of strain name to cluster id
For cluster id in cluster ids
Create list of SeqRecords from strains in the cluster
Create MultipleSeqAlignment from records list
Create dumb consensus from MultipleSeqAlignment
Write dumb consensus (named by cluster id) to open consensus FASTA file handle
We also want to parameterize the cluster id from the metadata using a --group-by argument to the consensus script, so we can pass in "MCC", "clade_membership", "mds_label", etc. The update proposed to the MERS Snakefile in this PR shows an example of how we want to parameterize the Snakemake rule for consensus sequences by embedding method, so we can get cluster-specific mutations per method. The final consensus table will need to include a column for the embedding method along with the pathogen, position, and mutation information that it already includes.
One approach to finishing this functionality would be to use BioPython's "dumb" consensus function inside a single for loop through the different cluster ids. When we read sequences in from the alignment, we can keep them as SeqRecord instances so the algorithm looks like:
SeqRecord
s into mapping of strain name toSeqRecord
SeqRecord
s from strains in the clusterMultipleSeqAlignment
from records listMultipleSeqAlignment
We also want to parameterize the cluster id from the metadata using a
--group-by
argument to the consensus script, so we can pass in "MCC", "clade_membership", "mds_label", etc. The update proposed to the MERS Snakefile in this PR shows an example of how we want to parameterize the Snakemake rule for consensus sequences by embedding method, so we can get cluster-specific mutations per method. The final consensus table will need to include a column for the embedding method along with the pathogen, position, and mutation information that it already includes.