Closed stevenweaver closed 9 months ago
Hi @stevenweaver sorry for the delayed response! We'll look into it.
Ok I've reproduced the problem on my machine
While I'm troubleshooting this @stevenweaver can you please try using https://github.com/PoonLab/tn ? As I recall, tn
the original code that we used for the studies, and clustuneR
was @ConnorChato refactoring of tn
to make it more user friendly. Alternatively, you might try installing an older release of clustuneR
- we may have broken something since then.
I just confirmed that the example code for https://github.com/PoonLab/tn runs as expected
The exception:
Error in eval(predvars, data, env) : object 'growth' not found
is thrown here: https://github.com/PoonLab/clustuneR/blob/353d3fae1357de6eff597c61713e13e9465c6738/R/analysis.R#L84-L89
It looks like we are asking for a growth
variable that is not in the data frame:
> predictive.models
$nullmodel
function(x){
glm(growth~size, data=x, family="poisson")
}
<bytecode: 0x55d9130da9a0>
$timemodel
function(x){
glm(growth~size+coldate, data=x, family="poisson")
}
It looks like this might be a case sensitivity issue:
> head(model.data[SetID==1,])
Header Size Growth SetID RangeID coldate
1: KU190002.1_2008-06-26_subA 1 0 1 0 14056
2: KU190014.1_2010-03-05_subA 1 0 1 0 14673
3: KU190024.1_2012-03-22_subA 1 0 1 0 15421
4: KU190032.1_2007-03-23_subB 1 0 1 0 13595
5: KU190038.1_2007-11-28_subB 1 0 1 0 13845
6: KU190064.1_2007-12-03_subB 1 0 1 0 13850
Problem was resolved as follows:
p.models = list(
"nullmodel" = function(x){
glm(Growth~Size, data=x, family="poisson")
},
"timemodel" = function(x){
glm(Growth~Size+coldate, data=x, family="poisson")
}
)
The next two lines also have case issues:
> aics <- get.aic(res)
Error in get.aic(res) : could not find function "get.aic"
> aics <- get.AIC(res)
> delta.aic <- aics$timemodelaic - aics$nullmodelaic
> delta.aic
integer(0)
> delta.aic <- aics$timemodelAIC - aics$nullmodelAIC
> head(delta.aic)
[1] 2.0000000 -0.9945144 0.3049146 -0.9269326 -0.4059748 1.3859614
I was going to go correct the README file but it none of these case differences appear there. Where did you get that example script @stevenweaver ?
@ConnorChato we have a different problem - the example script is not generating the same delta.AIC
profile as what is shown in the current README document:
This seems to be the situation:
tn
repository, documented example usage and linked the tn
repo in the paper. These examples still work as expected. The confusing thing is that the README for clustuneR
had ALSO been changed to focus on phylogenetic clustering, e.g., the example code that @stevenweaver was trying to reproduce.I think we need to take the following actions:
tn
with a release and tag associated with the phylogenetic clustering paper.tn
repo and make sure that we can reproduce results from the first paper.clustuneR
as a guide to refactoring and cleaning up code in the tn
repo, making sure to keep these changes in a separate branch. We might even absorb the code in clustuneR
into a separate branch within tn
.clustuneR
repo to tn
.Okay I was able to reproduce results from the 2020 paper with the clustuneR code at tag v1.0
, see issue #13
Hi Art - catching up on this. Thank you for quickly working on the fix. It is looking like our documentation was maybe out of date with the code itself. I can take some time to retrace my steps and write up a side-by-side of where the code and documentation differ to aid in a README.md update if that's needed.
Yeah we really need to update the README so that there are instructions for running either graph component clustering or phylogenetic clustering. As I recall a while back, tn
and clustuneR
were giving us discordant results on the same data - I'm not sure whether this is still the case.
Working from tn
, current master
branch, I am not reproducing results from the 2020 paper (note sttn93.txt
was retrieved from Erasmas backup at /home/cchato/Data/Seattle/
:
setwd("~/git/tn")
source("comp_Lib.R")
iG <- impTN93(iFile="data2020/Seattle/sttn93.txt", reVars='_', varInd=c(1,2),
dateFormat="%Y", partQ=1-0.06819591)
# this partQ yields 110 new cases sampled in 2012
maxDs <- seq(0, 0.04, 0.001)
res <- GAICRun(g, maxDs=maxDs, nCores=4, plotGAIC = TRUE)
par(mar=c(5,5,1,1))
plot(maxDs, res$GAIC, xlab="Thresholds", ylab="AIC Loss", type='b')
abline(h=0, lty=2)
@SandeepThokala can you please investigate what the differences are between comp_Lib.R
in tn
and MountainPlot.R
from the v1.0
release of clustuneR
?
Comparing impTN93
and gaicRun
functions from both repositories.
clustuneR
repository's comp_Lib.R
script has impTN93
function, which is reading in the input file as a csv
impTN93 <- function(iFile, minNS=63){
idf <- read.csv(iFile, stringsAsFactors = F)
tn
repository's impTN93
function is using fread
for reading input file and is only used in a couple of deprecated files.
impTN93 <- function(iFile, reVars="_", varInd=c(1, 2), addVarN=NA, dateFormat="%Y", partQ=0.95) {
idt <- fread(iFile)
clustuneR
repository's gaicRun
function returns
#@return: A data frame of each runs cluster information (clusterAnalyze output)
tn
repository's GAICRun
function returns
#@return: A data table of each runs cluster information.
# Both null and proposed model AIC values, as well as the AIC loss ($nullAIC, $modAIC and $GAIC)
# The max size, average size and number of singletons ($SizeMax, $MeanSize and $Singletons)
# The total growth, largest growth and number of growing clusters ($GrowthTot, $GrowthMax, and $nGrowing)
# The ID of the largest cluster and the cluster with the highest growth ($SizeMaxID and $GrowthMaxID)
# The effect ratio of mean recency in growing clusters over mean recency in non-growing clusters ($xMag)
To reiterate, we should be able to regenerate results from the 2020 paper using current code, e.g.,
https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R
using MountainPlot.R
from release v1.0
for reference
Of course we also need to update the README
to give examples for both phylogenetic and graph component clustering
We should update the README with instructions for doing a component clustering analysis:
impTN93
)gaicRun
)gaicRun
)gaicRun
and generate plotI am currently working on updating the README with the component based clustering. My most recent walkthrough of the code we have there for the Tree-Based clustering method did produce the expected results and matched the figure in the README.
Okay @ConnorChato, I'll wait until you've committed your changes before doing anything drastic!
Working on two things in this PR https://github.com/PoonLab/clustuneR/pull/14.
Ok I spent a couple of days reviewing and revising code in a new branch iss12
(commit d5fd56b552283c997bd86fd029c1b15228a82a24) that I derived from the component_example
branch, and managed to generate a delta-AIC plot that looks almost exactly like the MountainPlot example above (I think from the same data):
setwd('~/git/clustuneR')
source("R/sequence.setup.R")
source("R/graph.setup.R")
source("R/graph.clustering.R")
source("R/analysis.R")
edge.info <- read.csv("data/example_tn93.txt")
headers <- unique(c(edge.info$ID1, edge.info$ID2))
seq.info <- parse.headers(headers, var.names=c("ID", "colyear"))
which.new <- (seq.info$colyear == max(seq.info$colyear))
obj <- create.graph(seq.info, edge.info, which.new)
param.list <- list()
vals <- seq(0, 0.04, 0.001)
for (i in 1:length(vals)) {
param.list[[i]] <- list(dist.thresh=vals[i])
}
cluster.data <- multi.cluster(obj, param.list, component.cluster)
ptrans <- list(colyear=mean)
pmods <- list(
"NullModel" = function(x) {
glm(Growth~Size, data=x, family="poisson")
},
"AltModel" = function(x) {
glm(Growth ~ Size + colyear, data=x, family="poisson")
})
model.fits <- fit.analysis(cluster.data, ptrans, pmods)
aic <- get.AIC(model.fits, param.list)
par(mar=c(5,5,1,1))
plot(aic$dist.thresh[aic$model=='NullModel'],
aic$AIC[aic$model=='NullModel'], type='l',
xlab="Distance threshold", ylab="AIC")
lines(aic$dist.thresh[aic$model=='AltModel'],
aic$AIC[aic$model=='AltModel'], col='red')
plot(aic$dist.thresh[aic$model=='NullModel'],
aic$AIC[aic$model=='AltModel'] - aic$AIC[aic$model=='NullModel'],
type='l', xlab="Distance threshold", ylab="Delta AIC")
abline(h=0, lty=2)
I changed some of the internal data structures to my liking (especially cutting out all the tidyverse code).
Next steps are to revise the tree clustering code to use the same data structures, and write some dang unit tests!
Unit tests are a great idea for this. I think some can be harvested from the sanity checking I was doing in this round of fixes. @ArtPoon was the input data/example_tn93.txt
for the above plot built from the example Alberta Data?
@SandeepThokala can you please help me write unit tests for the following R scripts in the iss12
branch?
sequence-setup.R
graph-setup.R
graph-clustering.R
I retrieved original data files (TN93 CSV outputs) from Erasmas. The following script reproduces the delta-AIC profile for the Tennessee data set:
require(dplyr)
source("MountainPlot/comp_Lib.R")
# run through v1.0 (MountainPlot)
iFile <- "data/chato2020/tn93NAsubB.txt"
iG <- impTN93(iFile) # this takes a while
steps <- head(hist(subset(iG$e, Distance<0.05)$Distance, plot=FALSE)$breaks, -5)
cutoffs <- seq(0 , max(steps), max(steps)/50)
#A set of several graphs created at different cutoffs
gs <- lapply(cutoffs, function(d) {dFilt(iG, d)})
#Generate cluster data for each subGraph in gs
res <- lapply(gs, function(subG) {compAnalyze(subG)})
names(res) <- cutoffs
gaics <- sapply(res, function(x) {x$gaic})
modAIC <- sapply(res, function(x) {x$ageFit$aic})
nullAIC <- sapply(res, function(x) {x$nullFit$aic})
plot(cutoffs, gaics, type='o', pch=19, cex=0.8)
abline(a=0, b=1, lty=2)
Running the same data through the current clustuneR
code yields a different profile:
edge.info <- read.csv("data/chato2020/tn93NAsubB.txt")
source("R/sequence.setup.R")
headers <- unique(c(edge.info$ID1, edge.info$ID2))
length(headers)
seq.info <- parse.headers(headers, sep="_", var.names = c("id", "colyear"),
var.transformations = c(as.character, as.integer))
which.new <- seq.info$colyear == max(seq.info$colyear)
sum(which.new) # n=153 new cases in 2015
source("R/graph.setup.R")
obj <- read.edges(seq.info, edge.info, which.new)
source("R/graph.clustering.R")
source("R/analysis.R")
param.list <- list()
vals <- seq(0, 0.04, 8e-4)
for (i in 1:length(vals)) {
param.list[[i]] <- list(dist.thresh=vals[i])
}
cluster.data <- multi.cluster(obj, param.list, component.cluster)
# total growth (number of new cases in clusters)
plot(sapply(split(cluster.data$Growth, cluster.data$SetID), sum), type='s')
# number of growing clusters
plot(sapply(split(cluster.data$Growth>0, cluster.data$SetID), sum),
pch=19, cex=0.8)
ptrans <- list(colyear=mean)
pmods <- list(
"NullModel" = function(x) {
glm(Growth ~ Size, data=x, family="poisson")
},
"AltModel" = function(x) {
glm(Growth ~ Size + colyear, data=x, family="poisson")
})
model.fits <- fit.analysis(cluster.data, ptrans, pmods)
aic <- get.AIC(model.fits, param.list)
# add to existing plot
lines(aic$dist.thresh, aic$AltModel-aic$NullModel, type='o', pch=19, cex=0.8, col='red')
abline(h=0, lty=2)
I think a good place to start investigating the cause is to drill down on the distribution of new cases among clusters for a given distance threshold. subG
is an object created by code in v1.0 comp_Lib.R
, while cluster.data
is a data.table
object produced by the current main
branch on the same data.
subG
is a list containing the following values:
v
: data frame of vertices (ID, Time, Weight, Cluster)e
: data frame of edges given cutofff
: data frame, proportion of edges from time point t1 to t2 below distance thresholdg
: integer, number of new cases (growth) per clusterc
: integer, number of known cases per clustergaic
: numeric, stores the difference in AICageMod
: glm, fit of bipartite edge density as binomial outcomeageFit
: glm, alternate Poisson model fitnullFit
: glm, baseline Poisson model fitThis reveals that the distribution of new cases among clusters (cluster growth) is different:
> subG <- res[[10]]
> cd10 <- cluster.data[cluster.data$SetID==10,]
> dim(cd10)
[1] 524 10
> table(cd10$Size)
1 2 3 4 5 6 8 9 10 11
442 51 16 2 2 4 3 1 2 1
> table(cd10$Growth)
0 1 2 3 12
498 22 2 1 1
> table(subG$c)
1 2 3 4 5 6 8 9 10 11
442 51 16 2 2 4 3 1 2 1
> table(subG$g)
0 1 2 3 12
496 22 3 2 1
> table(subG$c, subG$g)
0 1 2 3 12
1 429 12 1 0 0
2 42 8 0 1 0
3 14 1 1 0 0
4 2 0 0 0 0
5 2 0 0 0 0
6 4 0 0 0 0
8 2 0 1 0 0
9 1 0 0 0 0
10 0 1 0 0 1
11 0 0 0 1 0
> table(cd10$Size, cd10$Growth)
0 1 2 3 12
1 430 12 0 0 0
2 43 8 0 0 0
3 14 1 1 0 0
4 2 0 0 0 0
5 2 0 0 0 0
6 4 0 0 0 0
8 2 0 1 0 0
9 1 0 0 0 0
10 0 1 0 0 1
11 0 0 0 1 0
This is pretty bad, the number of new cases joining clusters is discordant - @ConnorChato noted that this might be the problem in the PR:
> sum(cd10$Growth)
[1] 41
> sum(subG$g)
[1] 46
Something strange going on with TN93. I noticed that some of the accession numbers for the NA data were truncated in the files I had retrieved from Erasmas, so I went back to the naFull.fasta
alignment and regenerated a TN93 CSV file. This is what I get when I plot the sorted distances (y) against the sorted original (x):
Basically they are identical for values below 0.09 or so, and then veer off wildly. Funny thing is that both Erasmas and Orolo are running the same version of TN93 (v1.0.6)
Using new TN93 data gives similar but slightly altered results:
Repeating above with revised TN93 data:
> subG <- res[[10]]
> cd10 <- cluster.data[cluster.data$SetID==10,]
> dim(cd10)
[1] 521 10
> table(cd10$Size)
1 2 3 4 5 6 8 9 10 16
441 49 16 2 2 4 3 1 2 1
> table(cd10$Growth)
0 1 2 3 12
495 22 2 1 1
> table(subG$c)
1 2 3 4 5 6 8 9 10 16
441 49 16 2 2 4 3 1 2 1
> table(subG$g)
0 1 2 3 12
493 22 3 2 1
> table(cd10$Size, cd10$Growth)
0 1 2 3 12
1 429 12 0 0 0
2 41 8 0 0 0
3 14 1 1 0 0
4 2 0 0 0 0
5 2 0 0 0 0
6 4 0 0 0 0
8 2 0 1 0 0
9 1 0 0 0 0
10 0 1 0 0 1
16 0 0 0 1 0
> table(subG$c, subG$g)
0 1 2 3 12
1 428 12 1 0 0
2 40 8 0 1 0
3 14 1 1 0 0
4 2 0 0 0 0
5 2 0 0 0 0
6 4 0 0 0 0
8 2 0 1 0 0
9 1 0 0 0 0
10 0 1 0 0 1
16 0 0 0 1 0
Exporting edge list for rendering with Graphviz:
> foo <- edge.info[edge.info$Distance < cutoffs[10],]
> dim(foo)
[1] 502 3
> head(foo)
ID1 ID2 Distance
4 KU190152.1_2008-12-09_subB KU190153.1_2008-12-09_subB 0.001970860
269 KU190360.1_2009-12-22_subB KU190378.1_2010-02-09_subB 0.006938390
369 KU190256.1_2009-03-17_subB KU190259.1_2009-03-24_subB 0.002965970
838 KU190528.1_2011-08-11_subB KU190541.1_2011-09-08_subB 0.000000000
2055 KU190495.1_2011-04-07_subB KU190535.1_2011-08-22_subB 0.003944520
3350 KU190495.1_2011-04-07_subB KU190582.1_2011-05-11_subB 0.000985765
> temp <- sapply(foo$ID1, function(x) strsplit(x, '-')[[1]][1])
> head(temp)
KU190152.1_2008-12-09_subB KU190360.1_2009-12-22_subB KU190256.1_2009-03-17_subB
"KU190152.1_2008" "KU190360.1_2009" "KU190256.1_2009"
KU190528.1_2011-08-11_subB KU190495.1_2011-04-07_subB KU190495.1_2011-04-07_subB
"KU190528.1_2011" "KU190495.1_2011" "KU190495.1_2011"
> foo$ID1 <- sapply(foo$ID1, function(x) strsplit(x, '-')[[1]][1])
> foo$ID2 <- sapply(foo$ID2, function(x) strsplit(x, '-')[[1]][1])
> write.csv(foo, "iss12.edges.csv", row.names=F)
import csv
max_year = '2013'
reader = csv.DictReader(open("iss12.edges.csv"))
outfile = open("iss12.dot", 'w')
outfile.write("graph iss12 {\n\toutputorder=\"edgesfirst\";\n\tnode [shape=rect style=filled];")
all_ids = set()
for row in reader:
# write edges
id1, year1 = row['ID1'].split('_')
if id1 not in all_ids:
all_ids.add(id1)
outfile.write(f"\t\"{id1}\" [fillcolor=\"{'red' if year1==max_year else 'white'}\"];\n")
id2, year2 = row['ID2'].split('_')
if id2 not in all_ids:
all_ids.add(id2)
outfile.write(f"\t\"{id2}\" [fillcolor=\"{'red' if year2==max_year else 'white'}\"];\n")
dist = float(row['Distance'])
outfile.write(f"\t\"{id1}\"--\"{id2}\" [len={0.7+200*dist}];\n")
outfile.write("}\n")
outfile.close()
Investigating the cluster of 2 that should grow by 3 first:
> which(subG$c==2 & subG$g==3)
384
384
> subG$v[subG$v$Cluster==384, ]
ID Time Weight Cluster
42107 KU190552.1 2011 0.09721059 384
106959 KU190594.1 2011 0.09721059 384
207857 KU190733.1 2013 0.05985794 384
215766 KU190774.1 2013 0.05985794 384
224750 KU190811.1 2013 0.05985794 384
> which(grepl("KU190552", cd10$id))
[1] 96
> cd10[96,]
ClusterID Header
1: 96 KU190552.1_2011-12-01_subB,KU190594.1_2011-07-13_subB
id colyear Cluster Size Growth DistThresh SetID RangeID
1: KU190552.1,KU190594.1 2011,2011 96 2 0 0.0072 10 0
This cluster is definitely in the data - excerpt from DOT graph generated above (filtering edges on distance threshold of 0.0072):
Note I had to modify v1.0 comp_Lib.R
a bit to handle the new headers:
3c3
<
---
> require(lubridate)
13a14
>
20,21c21,22
< el <- data.frame(ID1=as.character(temp1), t1=as.numeric(temp2),
< ID2=as.character(temp3), t2=as.numeric(temp4),
---
> el <- data.frame(ID1=as.character(temp1), t1=year(as.Date(temp2)),
> ID2=as.character(temp3), t2=year(as.Date(temp4)),
Step through component.cluster
to see where these edges are disappearing:
> dist.thresh <- 0.0072
> edge.info <- read.csv("data/chato2020/naFull-subB.tn93.csv")
> headers <- unique(c(edge.info$ID1, edge.info$ID2))
> seq.info <- parse.headers(headers, sep="_", var.names = c("id", "coldate", "subtype"),
+ var.transformations = c(as.character, as.Date, as.factor))
> seq.info$subtype <- NULL
> seq.info$colyear <- year(seq.info$coldate)
> seq.info$coldate <- NULL
> which.new <- seq.info$colyear == max(seq.info$colyear)
> obj <- read.edges(seq.info, edge.info, which.new)
> filtered.edges <- obj$edge.info[obj$edge.info$Distance <= dist.thresh, ]
> which(obj$seq.info$id=="KU190552.1")
[1] 104
> filtered.edges[which(filtered.edges$ID1==104),]
ID1 ID2 Distance
43515 104 246 0.00295946
> filtered.edges[which(filtered.edges$ID2==104),]
[1] ID1 ID2 Distance
<0 rows> (or 0-length row.names)
> obj$seq.info[246,]
Header id colyear New
1: KU190594.1_2011-07-13_subB KU190594.1 2011 FALSE
> subG$e[which(subG$e$ID1=="KU190594.1"),] # the other known case in this cluster
ID1 t1 ID2 t2 Distance tMax tDiff
6604 KU190594.1 2011 KU190733.1 2013 0.00594669 2013 2
68653 KU190594.1 2011 KU190774.1 2013 0.00395159 2013 2
7334 KU190594.1 2011 KU190811.1 2013 0.00593821 2013 2
> subG$e[which(subG$e$ID2=="KU190594.1"),]
ID1 t1 ID2 t2 Distance tMax tDiff
43515 KU190552.1 2011 KU190594.1 2011 0.00295946 2011 0
> which(obj$seq.info$id=="KU190594.1")
[1] 246
> filtered.edges[which(filtered.edges$ID2==246),]
ID1 ID2 Distance
43515 104 246 0.00295946
> filtered.edges[which(filtered.edges$ID1==246),]
[1] ID1 ID2 Distance
<0 rows> (or 0-length row.names)
OK, so filtered.edges
is supposed to contain edges to new cases but it does not. There's something wrong in minimum.retrospective.edge
.
Found the bug. minimum.retrospective.edge
is including edges between new cases for its selection:
> which(seq.info$id %in% c("KU190733.1", "KU190774.1", "KU190811.1"))
[1] 525 564 620
> # extract row index for shortest edge for each new node
> min.retro.edges <- sapply(new.seqs, function(new.seq) {
+ my.edges <- c(which(obj$edge.info$ID1 == new.seq | obj$edge.info$ID2 == new.seq))
+ my.subset <- obj$edge.info[my.edges, ]
+ my.edges[which.min(my.subset$Distance)]
+ })
> mins <- obj$edge.info[min.retro.edges,]
> mins[mins$ID2 %in% c(525, 564, 620),]
ID1 ID2 Distance
209815 525 620 0.00098504
216730 564 620 0.00197086
209815.1 525 620 0.00098504
This fixes the size/growth discrepancy:
> cd10 <- cluster.data[cluster.data$SetID==10,]
> table(cd10$Size, cd10$Growth)
0 1 2 3 12
1 428 12 1 0 0
2 40 8 0 1 0
3 14 1 1 0 0
4 2 0 0 0 0
5 2 0 0 0 0
6 4 0 0 0 0
8 2 0 1 0 0
9 1 0 0 0 0
10 0 1 0 0 1
16 0 0 0 1 0
cluster growth trends are the same
plot(cutoffs, sapply(res, function(x) sum(x$g)), type='l',
ylab="Number of new cases in clusters")
lines(cutoffs, sapply(split(cluster.data$Growth, cluster.data$DistThresh), sum), col='red')
number of active clusters is the same:
plot(cutoffs, sapply(res, function(x) sum(x$g>0)), type='l',
ylab="Number of active clusters")
lines(cutoffs, sapply(
split(cluster.data$Growth, cluster.data$DistThresh),
function(x) sum(x>0)
), col='red')
but dAIC profiles are still different:
> par(mar=c(5,5,1,1))
> plot(cutoffs, gaics, type='o', pch=19, cex=0.8)
> abline(a=0, b=1, lty=2)
> lines(aic$dist.thresh, aic$AltModel-aic$NullModel, col='red', pch=19, cex=0.8,
+ type='o')
Next place to look is the regression model fits. These are already different:
> model.fits$NullModel[10]
[[1]]
Call: glm(formula = Growth ~ Size, family = "poisson", data = x)
Coefficients:
(Intercept) Size
-3.2347 0.3402
Degrees of Freedom: 520 Total (i.e. Null); 519 Residual
Null Deviance: 304.4
Residual Deviance: 216.1 AIC: 282.3
> res[[10]]$nullFit
Call: glm(formula = Growth ~ Pred, family = "poisson", data = df2)
Coefficients:
(Intercept) Pred
-3.235 5.161
Degrees of Freedom: 520 Total (i.e. Null); 519 Residual
Null Deviance: 304.4
Residual Deviance: 216.1 AIC: 282.3
Note the AIC values are the same but the coefficients are different. This is because we used to apply weights to cluster sizes:
> fit1 <- res[[10]]$nullFit
> fit2 <- model.fits$NullModel[[10]]
> table(fit2$data$Size)
1 2 3 4 5 6 8 9 10 16
441 49 16 2 2 4 3 1 2 1
> table(fit1$data$Pred)
0.0659025787965616 0.131805157593123 0.197707736389685 0.263610315186246
441 49 16 2
0.329512893982808 0.39541547277937 0.527220630372493 0.593123209169054
2 4 3 1
0.659025787965616 1.05444126074499
2 1
v1.0 GLMs are fit here: https://github.com/PoonLab/clustuneR/blob/ef294dcf1f4121f5ec389257a24faaf403da4a4f/scripts/comp_Lib.R#L238-L242
df2
is the null model, where growth is predicted from the cluster size as a proportion of the total number of known cases. Since the regression is over clusters, the total numbers are irrelevant and the same regression fit is obtained from glm(Growth ~ c)
.
However, something goes awry when we add cluster recency. In v1.0, we derived a predictor variable from the expected decay in the probability of an edge between cases separated by time $\delta t$ (see Chato et al. 2020, equations 3 and 4). In the current branch we are taking the collection year directly as a predictor variable.
The expected edge probability is not simply a transformation of collection year because it is a function of the distance threshold.
I'm going to peel off the implementation of the edge probability model into another issue, and redirect the focus of this issue to fixing the tree clustering.
####################
# TREE CLUSTERING
# reproduce with old code
# https://github.com/PoonLab/tn/tree/master#readme
setwd("~/git/tn")
source("pplacer_utils.R")
input.tree <- "data/na.nwk"
tree.log <- "data/na.log"
full.alignment <- "data/na.fasta"
out.ref <- "data/na.refpackage"
program <- "IQ-TREE"
taxitCreate(input.tree, tree.log, full.alignment, out.ref, program)
pplacer_guppy(out.ref, "data/")
source("subT_Lib.R")
date.format <- "%Y-%m-%d"
var.indices <- c(1,2,3)
sep <- '_'
extra.vars = c("Subtype")
growth.info <- "data/na_growth.tree"
extended.tree <- import.tree(input.tree, var.indices, date.format, sep, extra.vars)
extended.tree$g <- growthSim(extended.tree, growth.info)
distance.thresholds <- seq(0, 0.04, 0.001)
bootstrap.requirements <- 0.95
cores <- 4
clustering.function <- STClu
clus <- multiSTClu(extended.tree, distance.thresholds, bootstrap.requirements,
cores, clustering.function)
proposed.var <- "Time"
runID <- 0
proposed.formula <- New~Old+Time
var.translation <- list(function(x){mean(x)})
v1res <- GAICRun(clus, runID, cores, proposed.formula, proposed.var, var.translation)
plot(distance.thresholds, v1res$GAIC, type='l')
######################
# re-run with current code
require(ape)
require(lubridate)
setwd("~/git/clustuneR")
seqs <- ape::read.FASTA("~/git/tn/data/na.fasta")
source("R/sequence.setup.R")
seq.info <- parse.headers(
names(seqs), sep="_", var.names=c('accession', 'coldate', 'subtype'),
var.transformations=c(as.character, as.Date, as.factor))
max.year <- max(year(seq.info$coldate))
old.seqs <- seqs[year(seq.info$coldate) < max.year]
phy <- ape::read.tree("data/na.nwk")
source("R/tree.setup.R")
phy.extend <- extend.tree(phy, seq.info, seqs, log.file="data/na.log")
param.list <- lapply(seq(0, 0.04, 0.001), function(x) list(branch.thresh=x, boot.thresh=0.95))
source("R/tree.clustering.R")
source("R/analysis.R")
cluster.sets <- multi.cluster(phy.extend, param.list, step.cluster)
p.models = list(
"NullModel" = function(x) glm(Growth~Size, data=x, family="poisson"),
"TimeModel" = function(x) glm(Growth~Size+coldate, data=x, family="poisson")
)
p.trans = list("coldate" = mean)
v2res <- fit.analysis(cluster.sets, predictive.models = p.models,
predictor.transformations = p.trans)
v2.gaic <- sapply(v2res$TimeModel, function(x) x$aic) -
sapply(v2res$NullModel, function(x) x$aic)
lines(seq(0, 0.04, 0.001), v2.gaic, col='red')
> cd10 <- cluster.sets[cluster.sets$SetID==10, ]
> head(cd10)
ClusterID Header accession coldate subtype Descendants
1: 1 KU189996.1_2008-06-12_subA KU189996.1 2008-06-12 subA 1
2: 2 KU189997.1_2008-09-09_subA KU189997.1 2008-09-09 subA 2
3: 3 KU189998.1_2008-09-10_subA KU189998.1 2008-09-10 subA 3
4: 4 KU189999.1_2008-11-04_subA KU189999.1 2008-11-04 subA 4
5: 5 KU190000.1_2008-11-04_subA KU190000.1 2008-11-04 subA 5
6: 6 KU190001.1_2008-05-06_subA KU190001.1 2008-05-06 subA 6
Size Growth BranchThresh BootThresh SetID RangeID
1: 1 0 0.009 0.95 10 0
2: 1 0 0.009 0.95 10 0
3: 1 0 0.009 0.95 10 0
4: 1 0 0.009 0.95 10 0
5: 1 0 0.009 0.95 10 0
6: 1 1 0.009 0.95 10 0
> cd10.old <- clus[clus$SetID==10, ]
> head(cd10.old)
ID Old New Membership MaxD MinB SetID Growing Time Subtype
1: 1 1 0 KU189996.1 0.009 0.95 10 FALSE 2008-06-12 subA
2: 9 1 0 KU190020.1 0.009 0.95 10 FALSE 2011-10-26 subA
3: 10 1 0 KU190026.1 0.009 0.95 10 FALSE 2012-09-05 subA
4: 13 1 0 KU190031.1 0.009 0.95 10 FALSE 2007-01-03 subB
5: 14 1 0 KU190389.1 0.009 0.95 10 FALSE 2010-03-02 subB
6: 24 1 0 KU190147.1 0.009 0.95 10 FALSE 2008-12-02 subB
> dim(cd10)
[1] 724 12
> dim(cd10.old)
[1] 599 10
Immediately we see a discordance in the number of rows in the respective data.table
objects for a given distance threshold (0.009).
Here's the breakdown of cluster size and cluster growth:
> table(cd10.old$Old, cd10.old$New)
0 1 2 3 9
1 494 14 1 1 0
2 63 2 0 0 0
3 10 2 1 0 0
4 2 0 0 0 0
6 3 2 0 0 0
7 0 1 0 0 0
8 1 0 0 0 0
10 0 1 0 0 0
157 0 0 0 0 1
> table(cd10$Size, cd10$Growth)
0 1 2 3 11
1 599 18 1 1 0
2 66 9 0 0 0
3 14 0 1 0 0
4 4 2 0 0 0
5 2 1 0 0 0
7 2 1 0 0 0
8 2 0 0 0 0
9 0 0 0 0 1
I guess let's start at the earliest step, the annotation of growing tips in the tree using pplacer
outputs:
> head(phy.extend$growth.info)
Header Bootstrap TermDistance PendantDistance Terminal
1: KU191034.1_2013-05-01_subC 1.000000 0.01825170 2.03131e-02 TRUE
2: KU190820.1_2013-10-21_subB 0.141822 0.00307456 1.09875e-06 FALSE
3: KU190820.1_2013-10-21_subB 0.142939 0.00307457 1.45385e-06 TRUE
4: KU190820.1_2013-10-21_subB 0.143128 0.00307457 1.29545e-06 FALSE
5: KU190820.1_2013-10-21_subB 0.142177 0.00307456 9.72928e-04 FALSE
6: KU190820.1_2013-10-21_subB 0.143317 0.00307457 1.45990e-06 FALSE
NeighbourNode Cluster
1: 720 720
2: 1404 1399
3: 480 480
4: 1403 1399
5: 1405 1399
6: 1402 1399
> head(extended.tree$g)
nID TermDist PenDist oConn Bootstrap
1: KU191034.1_2013-05-01_subC 0.01825170 2.03131e-02 720 1.000000
2: KU190820.1_2013-10-21_subB 0.00307456 1.09875e-06 1404 0.141822
3: KU190820.1_2013-10-21_subB 0.00307457 1.45385e-06 480 0.142939
4: KU190820.1_2013-10-21_subB 0.00307457 1.29545e-06 1403 0.143128
5: KU190820.1_2013-10-21_subB 0.00307456 9.72928e-04 1405 0.142177
6: KU190820.1_2013-10-21_subB 0.00307457 1.45990e-06 1402 0.143317
> all(extended.tree$g$nID == phy.extend$growth.info$Header)
[1] FALSE
> all(sort(extended.tree$g$nID) == sort(phy.extend$growth.info$Header))
[1] TRUE
> new.px <- phy.extend$growth.info[order(phy.extend$growth.info$Header), ]
> old.px <- extended.tree$g[order(extended.tree$g$nID), ]
> all(old.px$nID == new.px$Header)
[1] TRUE
> all(old.px$TermDist == new.px$TermDistance)
[1] FALSE
> plot(old.px$TermDist, new.px$TermDistance)
I thought the discrepancy might be due to random ordering of multiple graft sites for a given tip, but this does not appear to be the case:
> old.px[4:7,]
nID TermDist PenDist oConn Bootstrap
1: KU190029.1_2013-09-10_subA 0.0612241 0.00412592 907 0.1701670
2: KU190029.1_2013-09-10_subA 0.0613435 0.00607536 911 0.0926687
3: KU190029.1_2013-09-10_subA 0.0620539 0.00952974 906 0.0849338
4: KU190029.1_2013-09-10_subA 0.0523765 0.02610240 11 0.2086720
> old.px[old.px$nID=="KU190029.1_2013-09-10_subA",]
nID TermDist PenDist oConn Bootstrap
1: KU190029.1_2013-09-10_subA 0.0611702 6.74464e-06 909 0.0256374
2: KU190029.1_2013-09-10_subA 0.0604680 5.83353e-06 908 0.2477350
3: KU190029.1_2013-09-10_subA 0.0612240 3.19437e-03 910 0.1701850
4: KU190029.1_2013-09-10_subA 0.0612241 4.12592e-03 907 0.1701670
5: KU190029.1_2013-09-10_subA 0.0613435 6.07536e-03 911 0.0926687
6: KU190029.1_2013-09-10_subA 0.0620539 9.52974e-03 906 0.0849338
7: KU190029.1_2013-09-10_subA 0.0523765 2.61024e-02 11 0.2086720
> new.px[new.px$Header=="KU190029.1_2013-09-10_subA",]
Header Bootstrap TermDistance PendantDistance Terminal
1: KU190029.1_2013-09-10_subA 0.0256375 0.0611702 6.74464e-06 FALSE
2: KU190029.1_2013-09-10_subA 0.2477330 0.0604680 5.83353e-06 FALSE
3: KU190029.1_2013-09-10_subA 0.1701860 0.0612240 3.19987e-03 FALSE
4: KU190029.1_2013-09-10_subA 0.1701680 0.0612285 4.12592e-03 FALSE
5: KU190029.1_2013-09-10_subA 0.0926690 0.0613435 6.07536e-03 FALSE
6: KU190029.1_2013-09-10_subA 0.0849340 0.0620539 9.52974e-03 FALSE
7: KU190029.1_2013-09-10_subA 0.2086730 0.0523774 2.60981e-02 TRUE
NeighbourNode Cluster
1: 909 NA
2: 908 NA
3: 910 NA
4: 907 NA
5: 911 NA
6: 906 NA
7: 11 NA
> which(old.px$TermDist != new.px$TermDistance)
[1] 4 7 20 21 46 57 66 83 97 104 106 128 129 136 144 149 167 170 172
[20] 174 188 194 195 196 197 198 199 200 201 202 203 204 205 206 207 229 234 235
[39] 245 249 311 320 354 371 388 407 413 414 416 425 429 437 443 449 460 465 467
[58] 481 491 512 518 529 530 537 541 552 557 562 567
> x <- old.px$TermDist[old.px$nID=="KU190029.1_2013-09-10_subA"]
> y <- new.px$TermDistance[new.px$Header=="KU190029.1_2013-09-10_subA"]
> sort(x)
[1] 0.0523765 0.0604680 0.0611702 0.0612240 0.0612241 0.0613435 0.0620539
> sort(y)
[1] 0.0523774 0.0604680 0.0611702 0.0612240 0.0612285 0.0613435 0.0620539
Going to see if I can feed the pplacer
output from one analysis directly into the other instead of calling the program twice.
There does not appear to be any difference in outputs between runs of pplacer
:
> gs1 <- growthSim(extended.tree, "~/git/tn/data/na_growth.tree")
> gs2 <- growthSim(extended.tree, "~/git/clustuneR/data/chato2022/na_growth.tree")
> dim(gs1)
[1] 569 5
> dim(gs2)
[1] 569 5
> head(gs1)
nID TermDist PenDist oConn Bootstrap
1: KU190820.1_2013-10-21_subB 0.00307456 1.09875e-06 1404 0.141822
2: KU190820.1_2013-10-21_subB 0.00307457 1.45385e-06 480 0.142939
3: KU190820.1_2013-10-21_subB 0.00307457 1.29545e-06 1403 0.143128
4: KU190820.1_2013-10-21_subB 0.00307456 9.72928e-04 1405 0.142177
5: KU190820.1_2013-10-21_subB 0.00307457 1.45990e-06 1402 0.143317
6: KU190820.1_2013-10-21_subB 0.00307457 9.78450e-04 1407 0.143314
> head(gs2)
nID TermDist PenDist oConn Bootstrap
1: KU191034.1_2013-05-01_subC 0.01825170 2.03131e-02 720 1.000000
2: KU190820.1_2013-10-21_subB 0.00307456 1.09875e-06 1404 0.141822
3: KU190820.1_2013-10-21_subB 0.00307457 1.45385e-06 480 0.142939
4: KU190820.1_2013-10-21_subB 0.00307457 1.29545e-06 1403 0.143128
5: KU190820.1_2013-10-21_subB 0.00307456 9.72928e-04 1405 0.142177
6: KU190820.1_2013-10-21_subB 0.00307457 1.45990e-06 1402 0.143317
> all(sort(gs1$TermDist) == sort(gs2$TermDist))
[1] TRUE
> all(sort(gs1$PenDist) == sort(gs2$PenDist))
[1] TRUE
> # reordering is not stochastic outcome of `growthSim`
> gs2 <- growthSim(extended.tree, "~/git/clustuneR/data/chato2022/na_growth.tree")
> head(gs2)
nID TermDist PenDist oConn Bootstrap
1: KU191034.1_2013-05-01_subC 0.01825170 2.03131e-02 720 1.000000
2: KU190820.1_2013-10-21_subB 0.00307456 1.09875e-06 1404 0.141822
3: KU190820.1_2013-10-21_subB 0.00307457 1.45385e-06 480 0.142939
4: KU190820.1_2013-10-21_subB 0.00307457 1.29545e-06 1403 0.143128
5: KU190820.1_2013-10-21_subB 0.00307456 9.72928e-04 1405 0.142177
6: KU190820.1_2013-10-21_subB 0.00307457 1.45990e-06 1402 0.143317
Here I am bypassing the call to pplacer
by clustuneR
and feeding in the output from tn
:
> growth.info.from.tn <- annotate.growth(phy, growth.info.trees)
Warning message:
In annotate.growth(phy, growth.info.trees) :
Not all newly added sequences are noted in the seq.info of the tree
> head(growth.info.from.tn)
Header Bootstrap TermDistance PendantDistance Terminal
1: KU190820.1_2013-10-21_subB 0.141822 0.00307456 1.09875e-06 FALSE
2: KU190820.1_2013-10-21_subB 0.142939 0.00307457 1.45385e-06 TRUE
3: KU190820.1_2013-10-21_subB 0.143128 0.00307457 1.29545e-06 FALSE
4: KU190820.1_2013-10-21_subB 0.142177 0.00307456 9.72928e-04 FALSE
5: KU190820.1_2013-10-21_subB 0.143317 0.00307457 1.45990e-06 FALSE
6: KU190820.1_2013-10-21_subB 0.143314 0.00307457 9.78450e-04 FALSE
NeighbourNode
1: 1404
2: 480
3: 1403
4: 1405
5: 1402
6: 1407
> all(sort(gs1$TermDist) == sort(growth.info.from.tn$TermDist))
[1] TRUE
> all(sort(old.px$TermDist) == sort(growth.info.from.tn$TermDistance))
[1] TRUE
> all(sort(new.px$TermDist) == sort(growth.info.from.tn$TermDistance))
[1] FALSE
This suggests that there's something wrong with the extend.tree
function.
Running this again:
> setwd("~/git/tn")
> source("pplacer_utils.R")
> input.tree <- "data/na.nwk"
> tree.log <- "data/na.log"
> full.alignment <- "data/na.fasta"
> out.ref <- "data/na.refpackage"
> program <- "IQ-TREE"
> date.format <- "%Y-%m-%d"
> var.indices <- c(1,2,3)
> sep <- '_'
> extra.vars = c("Subtype")
> growth.info <- "data/na_growth.tree"
>
> extended.tree <- import.tree(input.tree, var.indices, date.format, sep, extra.vars)
Warning message:
In type.convert.default(addVars) :
'as.is' should be specified by the caller; using TRUE
> extended.tree$g <- growthSim(extended.tree, growth.info)
> setwd("~/git/clustuneR")
> seqs <- ape::read.FASTA("~/git/tn/data/na.fasta")
>
> source("R/sequence.setup.R")
> seq.info <- parse.headers(
+ names(seqs), sep="_", var.names=c('accession', 'coldate', 'subtype'),
+ var.transformations=c(as.character, as.Date, as.factor))
> phy <- ape::read.tree("data/na.nwk")
> # stepping through extend.tree()
> phy <- ape::read.tree("data/na.nwk")
> # Midpoint root for consistency and resolve multichotomies
> phy <- phytools::midpoint.root(phy)
> phy <- ape::multi2di(phy)
> var.names <- colnames(seq.info)
> # label new sequences
> seq.info$New <- !(seq.info$Header %in% phy$tip.label)
> phy$seq.info <- seq.info
> phy$node.info <- annotate.nodes(phy)
> phy$path.info <- annotate.paths(phy)
> growth.info.trees <- ape::read.tree("~/git/tn/data/na_growth.tree")
> phy$growth.info <- annotate.growth(phy, growth.info.trees)
Warning message:
In annotate.growth(phy, growth.info.trees) :
Not all newly added sequences are noted in the seq.info of the tree
> all(sort(extended.tree$g$TermDist) == sort(phy$growth.info$TermDistance))
[1] TRUE
> growth.info.trees <- ape::read.tree("~/git/clustuneR/data/chato2022/na_growth.tree")
> phy$growth.info <- annotate.growth(phy, growth.info.trees)
Warning message:
In annotate.growth(phy, growth.info.trees) :
Not all newly added sequences are noted in the seq.info of the tree
> all(sort(extended.tree$g$TermDist) == sort(phy$growth.info$TermDistance))
[1] TRUE
Made a lot of progress writing unit tests and revising code in main
branch.
This is a useful bit of code for viewing the tree:
require(ggfree)
L <- tree.layout(phy)
plot(L, cex=0.6)
points(L, pch=21, bg='white', cex=2)
text(L$edges$x1, L$edges$y0, label=L$edges$child)
text(x=(L$edges$x0+L$edges$x1)/2, y=L$edges$y0+0.1,
label=round(L$edges$length, 5), cex=0.7)
text(x=(L$edges$x0+L$edges$x1)/2, y=L$edges$y0-0.1,
label=L$nodes$label[L$edges$child], cex=0.7)
Current test results for divergence in graph clustering:
# v1.0 (MountainPlot) version
source("tests/testthat/helper_iss22.R")
iFile <- "data/chato2020/tn93NAsubB.txt"
iG <- impTN93(iFile) # note have to revert to as.numeric
cutoffs <- seq(0, 0.04, length.out=50)
res1 <- gaicRun(iG, cutoffs)
gaic1 <- sapply(res1, function(x) x$gaic)
# current main branch
edgelist <- read.csv(iFile)
headers <- unique(c(edgelist$ID1, edgelist$ID2))
seq.info <- parse.headers(headers, sep="_", var.names=c("accn", "colyear"),
var.transformations=c(as.character, as.integer))
which.new <- (seq.info$colyear == 2013)
obj <- read.edges(edgelist, seq.info, which.new)
times <- obj$seq.info$colyear
param.list <- lapply(1:length(cutoffs), function(i) {
list(dist.thresh=cutoffs[i], setID=i, time.var="colyear")
})
cluster.data <- multi.cluster(obj, param.list, component.cluster)
ptrans <- list("Weight"=sum)
pmods <- list(
"NullModel"=function(x) glm(Growth~Size, data=x, family="poisson"),
"AltModel"=function(x) glm(Growth~Weight, data=x, family="poisson")
)
res2 <- fit.analysis(cluster.data, transforms=ptrans, models=pmods)
gaic2 <- get.AIC(res2, param.list)
plot(cutoffs, gaic1, type='l', col='red')
abline(h=0, lty=2)
lines(cutoffs, gaic2$AltModel-gaic2$NullModel, col='blue')
Getting a lot closer...
Dear @ArtPoon and @ConnorChato,
I'm trying to use this package, but I'm running into some troubles.
This is the output when I try putting together the script in the README.
This is the code I'm using
Best, Steven