PoonLab / covizu

Rapid analysis and visualization of coronavirus genome variation
https://filogeneti.ca/CoVizu/
MIT License
45 stars 20 forks source link

XBB tree not rooting on XBB #515

Closed ArtPoon closed 5 months ago

ArtPoon commented 6 months ago
ArtPoon commented 6 months ago
ArtPoon commented 6 months ago

I should clarfiy - we are not actually using outgroup rooting - I am proposing to re-root the tree on the terminal branch leading to XBB, since that is by definition the root of the XBB subtree

ArtPoon commented 5 months ago

Here's the part of the batch.py script where the trees are being built: https://github.com/PoonLab/covizu/blob/10a40174019acf76ce417631a495dcba83da4908/batch.py#L191-L204

ArtPoon commented 5 months ago

It is somewhere in here that we need to manually root the XBB tree (the non-recombinant tree can be clock-rooted): https://github.com/PoonLab/covizu/blob/10a40174019acf76ce417631a495dcba83da4908/covizu/utils/batch_utils.py#L77-L84

ArtPoon commented 5 months ago

Doing some experiments locally using the JSON generated for #512

ArtPoon commented 5 months ago

I should clarfiy - we are not actually using outgroup rooting - I am proposing to re-root the tree on the terminal branch leading to XBB, since that is by definition the root of the XBB subtree

Sigh. I was wrong. We are outgroup rooting with the reference genome in the fasttree function: https://github.com/PoonLab/covizu/blob/10a40174019acf76ce417631a495dcba83da4908/covizu/treetime.py#L46-L54

Note reference is added to the FASTA by retrieve_genomes: https://github.com/PoonLab/covizu/blob/10a40174019acf76ce417631a495dcba83da4908/covizu/treetime.py#L190-L197

ArtPoon commented 5 months ago

So I think the easiest fix is to add an argument to retrieve_genomes to load an outgroup sequence from file. This needs to have been aligned against the reference in the same way that all other genomes have been aligned. We can then pass this as the reference entry in the lineages and seqs lists, while reserving the actual reference for reconstituting sequences from the feature sets of lineage representative genomes.

ArtPoon commented 5 months ago

Based on the original pango-designation milestone for XBB, I propose to use sample NY-PRL-220907_01F03 as the outgroup sequence for XBB. Unfortunately this genome sequence is not available on Genbank (the closest match by blastn is OP875194.1, with at least 7 nt differences), so we cannot deposit this outgroup sequence in the repo.

ArtPoon commented 5 months ago

Aligned to reference with command:

python covizu/minimap2.py gisaid_hcov-19_2024_03_12_18.fasta -o xbb_outgrp.fasta -a
ArtPoon commented 5 months ago

XBB timetree when rooting with reference sequence (WH1):

XBB timetree rooting with XBB outgroup:

ArtPoon commented 5 months ago

I also released the constraint of clock rate to 8e-5. This is the resulting RTT plot for the XBB tree: none

ArtPoon commented 5 months ago

Pushed changes in new branch iss515

ArtPoon commented 5 months ago

This is the script that I used for testing:

import covizu
from covizu.utils import gisaid_utils
from covizu.utils.progress_utils import Callback
from covizu.utils.batch_utils import *
import os
from argparse import Namespace
from datetime import datetime

cb = Callback()

args = Namespace()
args.infile = ""
args.ref = os.path.join(covizu.__path__[0], "data/NC_045512.fa")
args.vcf = os.path.join(
    covizu.__path__[0],
    "data/ProblematicSites_SARS-CoV2/problematic_sites_sarsCov2.vcf")
args.alias = os.path.join(
        covizu.__path__[0], 
        "data/pango-designation/pango_designation/alias_key.json")
args.lineages = os.path.join(
        covizu.__path__[0], "data/pango-designation/lineages.csv")
args.ttbin = "treetime"
args.clock = None  # 8e-4
args.datetol = 0.1
args.ft2bin = 'fasttree'
args.outdir = "iss515/"

with open("iss512.json") as handle:
    by_lineage = json.load(handle)

aliases = parse_alias(args.alias) 
designation = {}
for prefix, truename in aliases.items():
    if type(truename) is list:
        designation.update({prefix: {
            'type': 'XBB' if prefix == 'XBB' else 'recombinant',
            'fullname': '/'.join(truename)
        }})
    else:
        designation.update({prefix: {
            'type': 'XBB' if truename.startswith("XBB") else 'non-recombinant',
            'fullname': truename
        }})

# use results to partition by_lineage database
non_recomb = {}
xbb = {}
other_recomb = {}
for lineage, ldata in by_lineage.items():
    # Put unassigned lineages in non-recombinant category
    if lineage.lower() == "unassigned":
        non_recomb.update({lineage: ldata})
        continue

    prefix = lineage.split('.')[0]
    category = designation[prefix]['type']
    if category == 'non-recombinant':
        non_recomb.update({lineage: ldata})
    elif category == 'XBB':
        xbb.update({lineage: ldata})
    else:
        other_recomb.update({lineage: ldata})

timetree, residuals = build_timetree(non_recomb, args, callback=cb.callback, debug=True)
timestamp = datetime.now().isoformat().split('.')[0]
nwk_file = os.path.join(args.outdir, 'timetree.{}.nwk'.format(timestamp))
with open(nwk_file, 'w') as handle:
    Phylo.write(timetree, file=handle, format='newick')

xbb_file = os.path.join(args.outdir, 'xbbtree.{}.nwk'.format(timestamp))
with open(xbb_file, 'w') as handle:
    if xbb is not None:
        timetree_xbb, residuals_xbb = build_timetree(
            xbb, args, outgroup="xbb_outgrp.fasta", debug=True, callback=cb.callback)
        residuals.update(residuals_xbb)
        Phylo.write(timetree_xbb, file=handle, format='newick')
ArtPoon commented 5 months ago

@GopiGugan I have copied xbb_outgrp.fasta to the /home/covizu/covizu/covizu/data folders on both Paphlagon and BEVi

ArtPoon commented 5 months ago

The downside of requiring this external FASTA file is that someone will not be able to run the backend without it. We could either add instructions for generating this file to the INSTALL.md or CONTRIBUTING.md documentation, or add a failsafe when the FASTA file is not present (i.e., use XBB representative genome as outgroup, but do not remove it!).