Closed liamxg closed 3 months ago
Only some of that data was made public: the 17 VB-variant whole genomes are at GenBank with accession numbers MT458931 to MT458935 and MW689459 to MW689470 (the two putative recombinants have accession numbers MW689465 and MW689466). The partial pol sequences are not publicly available; if you'd like to use them for a research project you can submit a proposal to the cohort that owns them, ATHENA, here. (HIV is a special case among infectious diseases in terms of the sensitivity of data and how this weighs against usual considerations to make all data public.)
Dear @ChrisHIV,
Thanks for your kindly help. Could you please send me the code, very interested about this, thanks again.
The scientific process underlying that figure was done through (a) the BEAST software, which the user does not control by code but through a GUI; (b) a single command-line call to the IQtree software, which I'm afraid I didn't save exactly. The paper describes the options used for both processes.
My visualisation code used here is scattered over various files, badly documented, using data you don't have access to, so I can't usefully share. Manual pdf editing then generated the final results. I can describe some of the steps included in the code if you're looking for hints to create similar-looking trees with other data.
First these functions implemented the ancestral state reconstruction of region that was used for the tree that's internally coloured.
library(ape)
library(phangorn)
library(tidyverse)
library(phyloscannerR)
library(ggtree)
library(RColorBrewer)
sankoff.up <- function(tree, node, tip.states, existing = list()){
unique.characters <- sort(unique(tip.states))
if(is.tip(tree, node)){
scores <- rep(Inf, length(unique.characters))
scores[unique.characters == tip.states[node]] <- 0
} else {
kids <- Children(tree, node)
for(kid in kids){
existing <- sankoff.up(tree, kid, tip.states, existing)
}
scores <- map_dbl(1:length(unique.characters), function(x){
sum(map_dbl(kids, function(kid){
kiddy.scores <- existing[[kid]]
min(map_dbl(1:length(unique.characters), function(y){
if(x==y){
0 + kiddy.scores[y]
} else {
1 + kiddy.scores[y]
}
}))
}))
})
}
existing[[node]] <- scores
existing
}
sankoff.down <- function(tree, node, state.vec, scores, out.vec = rep(NA, length(a.tree$tip.label) + a.tree$Nnode)){
state.vec <- sort(state.vec)
current.scores <- scores[[node]]
if(is.root(tree, node)){
out.vec[node] <- state.vec[which(current.scores==min(current.scores))[1]]
} else {
parent = Ancestors(tree, node, type="parent")
parent.state = out.vec[parent]
current.scores[which(state.vec != parent.state)] <- current.scores[which(state.vec != parent.state)] + 1
out.vec[node] <- state.vec[which(current.scores==min(current.scores))[1]]
}
if(!is.tip(tree, node)){
for(kid in Children(tree, node)){
out.vec <- sankoff.down(tree, kid, state.vec, scores, out.vec)
}
}
out.vec
}
Then for the tree with reconstructed region I have
# Read in the tree file into 'tree' variable
# Merge the regions data into the tree
tree <- as_tibble(tree)
tree <- left_join(tree, df.regions, by = "ID_new")
# This appears to be necessary to get node date uncertainties to display
# correctly: offset all their uncertainty boundaries by the midpoint.
# Also offsetting a vector of lists can't be vectorised >:(
tree$height_0.95_HPD_offset <- vector(mode="list", length = 213)
for (i in 1:nrow(tree)) {
tree$height_0.95_HPD_offset[[i]] <- tree$height_0.95_HPD[[i]] - tree$height[[i]]
}
tree <- as.treedata(tree)
# Now delete all the node date uncertainties except the one at the root, ready for displaying the bar only for the node:
tree@data[tree@data$node != getMRCA(tree@phylo, tree@phylo$tip.label), ]$height_0.95_HPD_offset <- NA
# Make a vector of regions in the same order as the tip labels
df.regions.vec <- df.regions$HospitalRegion
names(df.regions.vec) <- df.regions$ID_new
df.regions.vec <- df.regions.vec[na.omit(tree@data$ID_new)]
# Ancestral reconstruction of region for internal nodes of the tree
region.reconstruction.scores <- sankoff.up(tree@phylo, getRoot(tree@phylo), df.regions.vec, list())
region.reconstruction <- sankoff.down(tree@phylo, getRoot(tree@phylo), unique(df.regions.vec),
region.reconstruction.scores,
rep(NA, length(tree@phylo$tip.label) + tree@phylo$Nnode))
region.reconstruction <- factor(region.reconstruction, ordered = T,
levels = c("Amsterdam", "Netherlands, N",
"Netherlands, E", "Netherlands, S",
"Netherlands, W",
"Netherlands, unknown", "Belgium",
"Switzerland"))
# Plot the tree
tree.display <- ggtree(tree, size=0.5, right=F, mrsd='2014-08-04',
aes(col = region.reconstruction)) +
theme_tree2() + theme_minimal() +
theme(legend.position = "right",
legend.title=element_blank(),
panel.grid.major.x = element_line(color = "grey85", size=0.5),
panel.grid.minor.x = element_blank(), #element_line(color = "grey85", size=0.25),
axis.title=element_blank(),
axis.text=element_blank(),
axis.ticks=element_blank(),
panel.grid.major.y = element_blank(),
panel.grid.minor.y = element_blank(),
legend.text=element_text(size=14)) +
geom_rootedge(rootedge = 3, color="#E41A1C") +
#scale_x_continuous(minor_breaks = c(1997.5, 2002.5, 2007.5, 2012.5)) +
coord_cartesian(xlim = c(earliest.year, 2015)) +
#geom_range(range='height_0.95_HPD_offset', color='gray', alpha=.7, size=1.5) +
#geom_tippoint(aes(fill=HospitalRegion), shape = 21, size = 1) +
#scale_fill_brewer(palette= "Set1") +
#geom_tippoint(aes(colour=HospitalRegion), size = 1) +
scale_color_brewer(palette= "Set1") +
#geom_tiplab(aes(label=ID_new, fill=factor(HospitalRegion)),
# size = 2, color = "black", geom = "label",
# label.padding = unit(0.02, "lines"), label.size = 0) +
theme(aspect.ratio=1)
For the tree with viral loads I have
# Read in the tree file into 'tree' variable
# Read the viral loads, used for colouring the tips, into 'df.vl' variable
tree <- as_tibble(tree)
tree <- left_join(tree, df.vl)
p <- ggtree(as.treedata(tree), layout="circular", right=T) + theme_tree(legend.position = "right") + #aes(color=group)
#scale_color_manual(values=c(color.not.lin, color.lin)) +
xlim(-0.02, NA) +
geom_treescale(x=0.05, y=101.5, offset=3, width=0.05, fontsize = 6) +
geom_tippoint(aes(fill=`log10(viral load)`), shape = 21, size = 4) +
scale_fill_viridis(name = expression('log'[10]*'(viral load)')) +
theme(legend.text=element_text(size=15),
legend.title=element_text(size=17))
Dear @ChrisHIV,
Really cool! That's help me a lot. Thanks.
BTW, could you please tell me how to integrate violin plots into the tree? Thanks.
That was manual pdf editing: I created a separate file with the violin plot, including two grey vertical lines at the same x axis values (i.e. calendar times) that the root node of the tree sits in between, imported the violin into the tree file, and moved & stretched it until the pairs of grey lines overlapped i.e. manually aligning the two x axes
Dear @ChrisHIV,
Cool! That's a great idea!
I plot my data like the below previously:
Dear @ChrisHIV,
Could you please tell me where to find the data and code for Fig.2? Thanks.