subcommand for phylo construction and plotting - Githubissues

ammaraziz / flukit

GNU Lesser General Public License v3.0

2 stars 2 forks source link

subcommand for phylo construction and plotting #5

Open ammaraziz opened 1 year ago

ammaraziz commented 1 year ago

For each batch processed an annotated tree must be constructed for each gene. Currently an R script handles the alignment, tree construction and plotting.

Example usage:

flukit treeplot --sequences {Path/input.fasta} --lineage {lineage} --output-dir {Path}

Input:

multifasta with headers such as XXXX.4 XXXX.6 Output:
pdf of annotated tree(s). references colored red, samples colored blue. Extra info such as how the trees were generated (methods), date of generation would be useful.
tsv of Closest Prototypic Virus (CPV) - headers are seqno, result where Result is the CPV

The subcommand handles the reference fasta (detected from the lineage). Must include fasta datasets in the package for each subtype and each gene.

Notes:

Reuse the alignment functions from align_frames.py
Use Biopython for tree generation to avoid extra/external deps https://biopython.org/wiki/Phylo
Use toyplot for tree plotting

Questions:

How to find the CPV?
- Given a list of known CPV per lineage, first calculate the distance of all samples to known CPV to generate a matrix. The closest ancestor per sample is the CPV. For tie breakers, a priority list is needed, probably the oldest vaccine strain or CPV is used.
- CPV.tsv with the headers cpv, gene, priority where priority is per gene. Use previous plots to create priority list.

Tasks:

[ ] Parse fasta file and combine with reference set
[ ] Parse or get from fuzee meta files
[ ] Get low reactors from fuzee
[ ] Rename sequences
[ ] Plot trees using toyplot
[ ] Calculate CPV using .get_distance method from ete3 package