ChrisHIV / hiv_vb_variant

4 stars 0 forks source link

where to find the data and code for Fig.2? #4

Closed liamxg closed 3 months ago

liamxg commented 3 months ago

Dear @ChrisHIV,

Could you please tell me where to find the data and code for Fig.2? Thanks.

ChrisHIV commented 3 months ago

Only some of that data was made public: the 17 VB-variant whole genomes are at GenBank with accession numbers MT458931 to MT458935 and MW689459 to MW689470 (the two putative recombinants have accession numbers MW689465 and MW689466). The partial pol sequences are not publicly available; if you'd like to use them for a research project you can submit a proposal to the cohort that owns them, ATHENA, here. (HIV is a special case among infectious diseases in terms of the sensitivity of data and how this weighs against usual considerations to make all data public.)

liamxg commented 3 months ago

Dear @ChrisHIV,

Thanks for your kindly help. Could you please send me the code, very interested about this, thanks again.

ChrisHIV commented 3 months ago

The scientific process underlying that figure was done through (a) the BEAST software, which the user does not control by code but through a GUI; (b) a single command-line call to the IQtree software, which I'm afraid I didn't save exactly. The paper describes the options used for both processes.

My visualisation code used here is scattered over various files, badly documented, using data you don't have access to, so I can't usefully share. Manual pdf editing then generated the final results. I can describe some of the steps included in the code if you're looking for hints to create similar-looking trees with other data.

First these functions implemented the ancestral state reconstruction of region that was used for the tree that's internally coloured.

library(ape)
library(phangorn)
library(tidyverse)
library(phyloscannerR)
library(ggtree)
library(RColorBrewer)

sankoff.up <- function(tree, node, tip.states, existing = list()){

  unique.characters <- sort(unique(tip.states))

  if(is.tip(tree, node)){

    scores <- rep(Inf, length(unique.characters))
    scores[unique.characters == tip.states[node]] <- 0
  } else {
    kids <- Children(tree, node)
    for(kid in kids){
      existing <- sankoff.up(tree, kid, tip.states, existing)
    }

    scores <- map_dbl(1:length(unique.characters), function(x){
      sum(map_dbl(kids, function(kid){
        kiddy.scores <- existing[[kid]]
        min(map_dbl(1:length(unique.characters), function(y){
          if(x==y){
            0 + kiddy.scores[y]
          } else {
            1 + kiddy.scores[y]
          }
        }))
      }))
    })

  }
  existing[[node]] <- scores
  existing
}

sankoff.down <- function(tree, node, state.vec, scores, out.vec = rep(NA, length(a.tree$tip.label) + a.tree$Nnode)){
  state.vec <- sort(state.vec)

  current.scores <- scores[[node]]
  if(is.root(tree, node)){
    out.vec[node] <- state.vec[which(current.scores==min(current.scores))[1]]
  } else {
    parent = Ancestors(tree, node, type="parent")
    parent.state = out.vec[parent]
    current.scores[which(state.vec != parent.state)] <- current.scores[which(state.vec != parent.state)] + 1
    out.vec[node] <- state.vec[which(current.scores==min(current.scores))[1]]
  }
  if(!is.tip(tree, node)){
    for(kid in Children(tree, node)){
      out.vec <- sankoff.down(tree, kid, state.vec, scores, out.vec)
    }
  }

  out.vec
}

Then for the tree with reconstructed region I have

# Read in the tree file into 'tree' variable

# Merge the regions data into the tree
tree <- as_tibble(tree)
tree <- left_join(tree, df.regions, by = "ID_new")

# This appears to be necessary to get node date uncertainties to display
# correctly: offset all their uncertainty boundaries by the midpoint.
# Also offsetting a vector of lists can't be vectorised >:(
tree$height_0.95_HPD_offset <- vector(mode="list", length =  213)
for (i in 1:nrow(tree)) {
  tree$height_0.95_HPD_offset[[i]] <- tree$height_0.95_HPD[[i]] - tree$height[[i]]
}

tree <- as.treedata(tree)

# Now delete all the node date uncertainties except the one at the root, ready for displaying the bar only for the node:
tree@data[tree@data$node != getMRCA(tree@phylo, tree@phylo$tip.label), ]$height_0.95_HPD_offset <- NA

# Make a vector of regions in the same order as the tip labels
df.regions.vec <- df.regions$HospitalRegion
names(df.regions.vec) <- df.regions$ID_new
df.regions.vec <- df.regions.vec[na.omit(tree@data$ID_new)]

# Ancestral reconstruction of region for internal nodes of the tree
region.reconstruction.scores <- sankoff.up(tree@phylo, getRoot(tree@phylo), df.regions.vec, list())
region.reconstruction <- sankoff.down(tree@phylo, getRoot(tree@phylo), unique(df.regions.vec),
                                      region.reconstruction.scores,
                                      rep(NA, length(tree@phylo$tip.label) + tree@phylo$Nnode))
region.reconstruction <- factor(region.reconstruction, ordered = T,
                                levels = c("Amsterdam", "Netherlands, N",
                                           "Netherlands, E", "Netherlands, S",
                                           "Netherlands, W",
                                           "Netherlands, unknown", "Belgium",
                                           "Switzerland"))

# Plot the tree
tree.display <- ggtree(tree, size=0.5, right=F, mrsd='2014-08-04', 
                       aes(col = region.reconstruction)) +
  theme_tree2() + theme_minimal() +
  theme(legend.position = "right",
        legend.title=element_blank(),
        panel.grid.major.x = element_line(color = "grey85", size=0.5),
        panel.grid.minor.x = element_blank(), #element_line(color = "grey85", size=0.25),
        axis.title=element_blank(),
        axis.text=element_blank(),
        axis.ticks=element_blank(),
        panel.grid.major.y = element_blank(),
        panel.grid.minor.y = element_blank(),
        legend.text=element_text(size=14)) +
  geom_rootedge(rootedge = 3, color="#E41A1C") + 
  #scale_x_continuous(minor_breaks = c(1997.5, 2002.5, 2007.5, 2012.5)) +
  coord_cartesian(xlim = c(earliest.year, 2015)) + 
  #geom_range(range='height_0.95_HPD_offset', color='gray', alpha=.7, size=1.5) +
  #geom_tippoint(aes(fill=HospitalRegion), shape = 21, size = 1) +
  #scale_fill_brewer(palette= "Set1") +
  #geom_tippoint(aes(colour=HospitalRegion), size = 1) +
  scale_color_brewer(palette= "Set1") +
  #geom_tiplab(aes(label=ID_new, fill=factor(HospitalRegion)),
  #            size = 2, color = "black", geom = "label",
  #            label.padding = unit(0.02, "lines"), label.size = 0) +
  theme(aspect.ratio=1) 

For the tree with viral loads I have

# Read in the tree file into 'tree' variable
# Read the viral loads, used for colouring the tips, into 'df.vl' variable
  tree <- as_tibble(tree)
  tree <- left_join(tree, df.vl)

  p <- ggtree(as.treedata(tree), layout="circular", right=T) + theme_tree(legend.position = "right") + #aes(color=group)
    #scale_color_manual(values=c(color.not.lin, color.lin)) +
    xlim(-0.02, NA) +
    geom_treescale(x=0.05, y=101.5, offset=3, width=0.05, fontsize = 6) +
    geom_tippoint(aes(fill=`log10(viral load)`), shape = 21, size = 4) +
    scale_fill_viridis(name = expression('log'[10]*'(viral load)')) +
    theme(legend.text=element_text(size=15),
          legend.title=element_text(size=17))
liamxg commented 3 months ago

Dear @ChrisHIV,

Really cool! That's help me a lot. Thanks.

BTW, could you please tell me how to integrate violin plots into the tree? Thanks.

image
ChrisHIV commented 3 months ago

That was manual pdf editing: I created a separate file with the violin plot, including two grey vertical lines at the same x axis values (i.e. calendar times) that the root node of the tree sits in between, imported the violin into the tree file, and moved & stretched it until the pairs of grey lines overlapped i.e. manually aligning the two x axes

liamxg commented 3 months ago

Dear @ChrisHIV,

Cool! That's a great idea!

I plot my data like the below previously:

image