Discordant results between v1.0 and main branch

stevenweaver commented 1 year ago

Dear @ArtPoon and @ConnorChato,

I'm trying to use this package, but I'm running into some troubles.

This is the output when I try putting together the script in the README.

New names:
* `x` -> `x...1`
* `x` -> `x...2`
* `x` -> `x...3`
Warning message:
In annotate.growth(t, growth.info.trees, mc.cores = mc.cores) :
  Not all newly added sequences are noted in the seq.info of the tree
Error in eval(predvars, data, env) : object 'growth' not found
Calls: fit.analysis -> lapply -> FUN -> lapply -> FUN
Execution halted
164.36user 21.23system 1:16.40elapsed 242%CPU (0avgtext+0avgdata 3490136maxresident)k
56inputs+0outputs (0major+2561167minor)pagefa

This is the code I'm using

require(clustuneR)
require(ape)
require(lubridate)

#setwd("~/git/clustuneR")
seqs <- ape::read.FASTA("data/na.fasta", type="DNA")
phy <- ape::read.tree("data/na.nwk")

# parse sequence headers (alternatively import from another file)
seq.info <- pull.headers(seqs, sep="_", var.names=c('accession', 'coldate', 'subtype'),
var.transformations=c(as.character, as.Date, as.factor))

max.year <- max(year(seq.info$coldate))
old.seqs <- seqs[year(seq.info$coldate) < max.year]

# use pplacer to graft new sequences onto old tree
phy.extend <- extend.tree(phy, seq.info, seqs, mc.cores=1, log.file="data/na.log")

# generate cluster sets under varying parameter settings
param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) list(t=phy.extend, branch.thresh=x, boot.thresh=0.95))
cluster.sets <- multi.cluster(step.cluster, param.list) 

# configure poisson regression models
p.models = list(
   "nullmodel" = function(x){
       glm(growth~size, data=x, family="poisson")
   },
   "timemodel" = function(x){
       glm(growth~size+coldate, data=x, family="poisson")
   }
)
p.trans = list(  # average sample collection dates across nodes in each cluster
   "coldate" = function(x){mean(x)}
)

res <- fit.analysis(cluster.sets, predictive.models=p.models, 
                   predictor.transformations=p.trans)
aics <- get.aic(res)
delta.aic <- aics$timemodelaic - aics$nullmodelaic

cutoffs <- sapply(param.list, function(x) x$branch.thresh)
par(mar=c(5,5,1,1))
plot(cutoffs, delta.aic, type='l', col='cadetblue', lwd=2)
abline(h=0, lty=2)

Best, Steven

ArtPoon commented 1 year ago

Hi @stevenweaver sorry for the delayed response! We'll look into it.

ArtPoon commented 1 year ago

Ok I've reproduced the problem on my machine

ArtPoon commented 1 year ago

While I'm troubleshooting this @stevenweaver can you please try using https://github.com/PoonLab/tn ? As I recall, tn the original code that we used for the studies, and clustuneR was @ConnorChato refactoring of tn to make it more user friendly. Alternatively, you might try installing an older release of clustuneR - we may have broken something since then.

ArtPoon commented 1 year ago

I just confirmed that the example code for https://github.com/PoonLab/tn runs as expected

ArtPoon commented 1 year ago

The exception:

Error in eval(predvars, data, env) : object 'growth' not found

is thrown here: https://github.com/PoonLab/clustuneR/blob/353d3fae1357de6eff597c61713e13e9465c6738/R/analysis.R#L84-L89

It looks like we are asking for a growth variable that is not in the data frame:

> predictive.models
$nullmodel
function(x){
        glm(growth~size, data=x, family="poisson")
    }
<bytecode: 0x55d9130da9a0>

$timemodel
function(x){
        glm(growth~size+coldate, data=x, family="poisson")
    }

It looks like this might be a case sensitivity issue:

> head(model.data[SetID==1,])
                       Header Size Growth SetID RangeID coldate
1: KU190002.1_2008-06-26_subA    1      0     1       0   14056
2: KU190014.1_2010-03-05_subA    1      0     1       0   14673
3: KU190024.1_2012-03-22_subA    1      0     1       0   15421
4: KU190032.1_2007-03-23_subB    1      0     1       0   13595
5: KU190038.1_2007-11-28_subB    1      0     1       0   13845
6: KU190064.1_2007-12-03_subB    1      0     1       0   13850

ArtPoon commented 1 year ago

Problem was resolved as follows:

p.models = list(
    "nullmodel" = function(x){
        glm(Growth~Size, data=x, family="poisson")
    },
    "timemodel" = function(x){
        glm(Growth~Size+coldate, data=x, family="poisson")
    }
)

The next two lines also have case issues:

> aics <- get.aic(res)
Error in get.aic(res) : could not find function "get.aic"
> aics <- get.AIC(res)
> delta.aic <- aics$timemodelaic - aics$nullmodelaic
> delta.aic
integer(0)
> delta.aic <- aics$timemodelAIC - aics$nullmodelAIC
> head(delta.aic)
[1]  2.0000000 -0.9945144  0.3049146 -0.9269326 -0.4059748  1.3859614

ArtPoon commented 1 year ago

I was going to go correct the README file but it none of these case differences appear there. Where did you get that example script @stevenweaver ?

ArtPoon commented 1 year ago

@ConnorChato we have a different problem - the example script is not generating the same delta.AIC profile as what is shown in the current README document:

ArtPoon commented 1 year ago

This seems to be the situation:

clustuneR was an attempt to refactor and clean up the code that was used for our component clustering paper, Chato et al. 2020. At the time the repo was called "MountainPlot". The code that was actually used for the project is in https://github.com/PoonLab/tn.
We then redirected our efforts to developing a phylogenetic clustering method. This seems to have been carried out in https://github.com/PoonLab/tn, based on its commit history.
Around April 2021, the phylogenetic clustering work was then migrated over to this repository (see commit 890c92e06ffb620af162491e6220703c3beb8b3b).
We released a second paper on phylogenetic clustering (Chato et al. 2022). I relied on the code in the tn repository, documented example usage and linked the tn repo in the paper. These examples still work as expected. The confusing thing is that the README for clustuneR had ALSO been changed to focus on phylogenetic clustering, e.g., the example code that @stevenweaver was trying to reproduce.

I think we need to take the following actions:

"Freeze" the current functionality in tn with a release and tag associated with the phylogenetic clustering paper.
Locate the original component clustering code in the tn repo and make sure that we can reproduce results from the first paper.
Use clustuneR as a guide to refactoring and cleaning up code in the tn repo, making sure to keep these changes in a separate branch. We might even absorb the code in clustuneR into a separate branch within tn.
Write a unit test suite to facilitate step 3. (@SandeepThokala)
Redirect users from the clustuneR repo to tn.

ArtPoon commented 1 year ago

Okay I was able to reproduce results from the 2020 paper with the clustuneR code at tag v1.0, see issue #13

ConnorChato commented 1 year ago

Hi Art - catching up on this. Thank you for quickly working on the fix. It is looking like our documentation was maybe out of date with the code itself. I can take some time to retrace my steps and write up a side-by-side of where the code and documentation differ to aid in a README.md update if that's needed.

ArtPoon commented 1 year ago

Yeah we really need to update the README so that there are instructions for running either graph component clustering or phylogenetic clustering. As I recall a while back, tn and clustuneR were giving us discordant results on the same data - I'm not sure whether this is still the case.

ArtPoon commented 1 year ago

Working from tn , current master branch, I am not reproducing results from the 2020 paper (note sttn93.txt was retrieved from Erasmas backup at /home/cchato/Data/Seattle/:

setwd("~/git/tn")
source("comp_Lib.R")

iG <- impTN93(iFile="data2020/Seattle/sttn93.txt", reVars='_', varInd=c(1,2), 
             dateFormat="%Y", partQ=1-0.06819591)
# this partQ yields 110 new cases sampled in 2012

maxDs <- seq(0, 0.04, 0.001)

res <- GAICRun(g, maxDs=maxDs, nCores=4, plotGAIC = TRUE)

par(mar=c(5,5,1,1))
plot(maxDs, res$GAIC, xlab="Thresholds", ylab="AIC Loss", type='b')
abline(h=0, lty=2)

ArtPoon commented 1 year ago

@SandeepThokala can you please investigate what the differences are between comp_Lib.R in tn and MountainPlot.R from the v1.0 release of clustuneR?

SandeepThokala commented 1 year ago

Comparing impTN93 and gaicRun functions from both repositories.

clustuneR repository's comp_Lib.R script has impTN93 function, which is reading in the input file as a csv
```
impTN93 <- function(iFile, minNS=63){
```
```
idf <- read.csv(iFile, stringsAsFactors = F)
```

tn repository's impTN93 function is using fread for reading input file and is only used in a couple of deprecated files.

impTN93 <- function(iFile, reVars="_", varInd=c(1, 2),  addVarN=NA, dateFormat="%Y", partQ=0.95) {

idt <- fread(iFile)

clustuneR repository's gaicRun function returns

#@return: A data frame of each runs cluster information (clusterAnalyze output)

tn repository's GAICRun function returns

#@return: A data table of each runs cluster information.
#         Both null and proposed model AIC values, as well as the AIC loss ($nullAIC, $modAIC and $GAIC)
#         The max size, average size and number of singletons ($SizeMax, $MeanSize and $Singletons)
#         The total growth, largest growth and number of growing clusters ($GrowthTot, $GrowthMax, and $nGrowing)
#         The ID of the largest cluster and the cluster with the highest growth ($SizeMaxID and $GrowthMaxID)
#         The effect ratio of mean recency in growing clusters over mean recency in non-growing clusters ($xMag)

ArtPoon commented 1 year ago

To reiterate, we should be able to regenerate results from the 2020 paper using current code, e.g., https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R using MountainPlot.R from release v1.0 for reference Of course we also need to update the README to give examples for both phylogenetic and graph component clustering

ArtPoon commented 12 months ago

We should update the README with instructions for doing a component clustering analysis:

Use tn93 to generate a CSV of distances from an input FASTA file of aligned sequences
Read CSV into R and generate a graph (impTN93)
Extract graph components and identify new cases (gaicRun)
Fit Poisson regression models to the distribution of new cases among components (gaicRun)
Extract results from object returned from gaicRun and generate plot

ConnorChato commented 11 months ago

I am currently working on updating the README with the component based clustering. My most recent walkthrough of the code we have there for the Tree-Based clustering method did produce the expected results and matched the figure in the README.

ArtPoon commented 11 months ago

Okay @ConnorChato, I'll wait until you've committed your changes before doing anything drastic!

ConnorChato commented 11 months ago

Working on two things in this PR https://github.com/PoonLab/clustuneR/pull/14.

Updating the component clustering to give results more similar to tn code
Adding that process as a workable example in the README.

ArtPoon commented 10 months ago

Ok I spent a couple of days reviewing and revising code in a new branch iss12 (commit d5fd56b552283c997bd86fd029c1b15228a82a24) that I derived from the component_example branch, and managed to generate a delta-AIC plot that looks almost exactly like the MountainPlot example above (I think from the same data):

setwd('~/git/clustuneR')
source("R/sequence.setup.R")
source("R/graph.setup.R")
source("R/graph.clustering.R")
source("R/analysis.R")

edge.info <- read.csv("data/example_tn93.txt")
headers <- unique(c(edge.info$ID1, edge.info$ID2))
seq.info <- parse.headers(headers, var.names=c("ID", "colyear"))

which.new <- (seq.info$colyear == max(seq.info$colyear))

obj <- create.graph(seq.info, edge.info, which.new)

param.list <- list()
vals <- seq(0, 0.04, 0.001)
for (i in 1:length(vals)) {
  param.list[[i]] <- list(dist.thresh=vals[i])
}

cluster.data <- multi.cluster(obj, param.list, component.cluster)

ptrans <- list(colyear=mean)
pmods <- list(
  "NullModel" = function(x) {
    glm(Growth~Size, data=x, family="poisson")
    },
  "AltModel" = function(x) {
    glm(Growth ~ Size + colyear, data=x, family="poisson")
  })

model.fits <- fit.analysis(cluster.data, ptrans, pmods)
aic <- get.AIC(model.fits, param.list)

par(mar=c(5,5,1,1))
plot(aic$dist.thresh[aic$model=='NullModel'], 
     aic$AIC[aic$model=='NullModel'], type='l',
     xlab="Distance threshold", ylab="AIC")
lines(aic$dist.thresh[aic$model=='AltModel'],
      aic$AIC[aic$model=='AltModel'], col='red')

plot(aic$dist.thresh[aic$model=='NullModel'],
     aic$AIC[aic$model=='AltModel'] - aic$AIC[aic$model=='NullModel'],
     type='l', xlab="Distance threshold", ylab="Delta AIC")
abline(h=0, lty=2)

I changed some of the internal data structures to my liking (especially cutting out all the tidyverse code).

ArtPoon commented 10 months ago

Next steps are to revise the tree clustering code to use the same data structures, and write some dang unit tests!

ConnorChato commented 10 months ago

Unit tests are a great idea for this. I think some can be harvested from the sanity checking I was doing in this round of fixes. @ArtPoon was the input data/example_tn93.txt for the above plot built from the example Alberta Data?

ArtPoon commented 10 months ago

@SandeepThokala can you please help me write unit tests for the following R scripts in the iss12 branch?

sequence-setup.R
graph-setup.R
graph-clustering.R

ArtPoon commented 9 months ago

I retrieved original data files (TN93 CSV outputs) from Erasmas. The following script reproduces the delta-AIC profile for the Tennessee data set:

require(dplyr)

source("MountainPlot/comp_Lib.R")

# run through v1.0 (MountainPlot)
iFile <- "data/chato2020/tn93NAsubB.txt"

iG <- impTN93(iFile)  # this takes a while

steps <- head(hist(subset(iG$e, Distance<0.05)$Distance, plot=FALSE)$breaks, -5)
cutoffs <- seq(0 , max(steps), max(steps)/50) 

#A set of several graphs created at different cutoffs
gs <- lapply(cutoffs, function(d) {dFilt(iG, d)})

#Generate cluster data for each subGraph in gs
res <- lapply(gs, function(subG) {compAnalyze(subG)})
names(res) <- cutoffs

gaics <- sapply(res, function(x) {x$gaic})
modAIC <- sapply(res, function(x) {x$ageFit$aic})
nullAIC <- sapply(res, function(x) {x$nullFit$aic})

plot(cutoffs, gaics, type='o', pch=19, cex=0.8)
abline(a=0, b=1, lty=2)

Running the same data through the current clustuneR code yields a different profile:

edge.info <- read.csv("data/chato2020/tn93NAsubB.txt")

source("R/sequence.setup.R")
headers <- unique(c(edge.info$ID1, edge.info$ID2))
length(headers)

seq.info <- parse.headers(headers, sep="_", var.names = c("id", "colyear"),
                          var.transformations = c(as.character, as.integer))

which.new <- seq.info$colyear == max(seq.info$colyear)
sum(which.new)  # n=153 new cases in 2015

source("R/graph.setup.R")
obj <- read.edges(seq.info, edge.info, which.new)

source("R/graph.clustering.R")
source("R/analysis.R")

param.list <- list()
vals <- seq(0, 0.04, 8e-4)
for (i in 1:length(vals)) {
  param.list[[i]] <- list(dist.thresh=vals[i])
}

cluster.data <- multi.cluster(obj, param.list, component.cluster)

# total growth (number of new cases in clusters)
plot(sapply(split(cluster.data$Growth, cluster.data$SetID), sum), type='s')

# number of growing clusters
plot(sapply(split(cluster.data$Growth>0, cluster.data$SetID), sum), 
     pch=19, cex=0.8)

ptrans <- list(colyear=mean)
pmods <- list(
  "NullModel" = function(x) {
    glm(Growth ~ Size, data=x, family="poisson")
  },
  "AltModel" = function(x) {
    glm(Growth ~ Size + colyear, data=x, family="poisson")
  })

model.fits <- fit.analysis(cluster.data, ptrans, pmods)
aic <- get.AIC(model.fits, param.list)

# add to existing plot
lines(aic$dist.thresh, aic$AltModel-aic$NullModel, type='o', pch=19, cex=0.8, col='red')
abline(h=0, lty=2)

ArtPoon commented 9 months ago

I think a good place to start investigating the cause is to drill down on the distribution of new cases among clusters for a given distance threshold. subG is an object created by code in v1.0 comp_Lib.R, while cluster.data is a data.table object produced by the current main branch on the same data.

subG is a list containing the following values:

v: data frame of vertices (ID, Time, Weight, Cluster)
e: data frame of edges given cutoff
f: data frame, proportion of edges from time point t1 to t2 below distance threshold
g: integer, number of new cases (growth) per cluster
c: integer, number of known cases per cluster
gaic: numeric, stores the difference in AIC
ageMod: glm, fit of bipartite edge density as binomial outcome
ageFit: glm, alternate Poisson model fit
nullFit: glm, baseline Poisson model fit

This reveals that the distribution of new cases among clusters (cluster growth) is different:

> subG <- res[[10]]
> cd10 <- cluster.data[cluster.data$SetID==10,]
> dim(cd10)
[1] 524  10
> table(cd10$Size)

  1   2   3   4   5   6   8   9  10  11 
442  51  16   2   2   4   3   1   2   1 
> table(cd10$Growth)

  0   1   2   3  12 
498  22   2   1   1 
> table(subG$c)

  1   2   3   4   5   6   8   9  10  11 
442  51  16   2   2   4   3   1   2   1 
> table(subG$g)

  0   1   2   3  12 
496  22   3   2   1
> table(subG$c, subG$g)

       0   1   2   3  12
  1  429  12   1   0   0
  2   42   8   0   1   0
  3   14   1   1   0   0
  4    2   0   0   0   0
  5    2   0   0   0   0
  6    4   0   0   0   0
  8    2   0   1   0   0
  9    1   0   0   0   0
  10   0   1   0   0   1
  11   0   0   0   1   0
> table(cd10$Size, cd10$Growth)

       0   1   2   3  12
  1  430  12   0   0   0
  2   43   8   0   0   0
  3   14   1   1   0   0
  4    2   0   0   0   0
  5    2   0   0   0   0
  6    4   0   0   0   0
  8    2   0   1   0   0
  9    1   0   0   0   0
  10   0   1   0   0   1
  11   0   0   0   1   0

ArtPoon commented 9 months ago

This is pretty bad, the number of new cases joining clusters is discordant - @ConnorChato noted that this might be the problem in the PR:

> sum(cd10$Growth)
[1] 41
> sum(subG$g)
[1] 46

ArtPoon commented 9 months ago

Something strange going on with TN93. I noticed that some of the accession numbers for the NA data were truncated in the files I had retrieved from Erasmas, so I went back to the naFull.fasta alignment and regenerated a TN93 CSV file. This is what I get when I plot the sorted distances (y) against the sorted original (x):

Basically they are identical for values below 0.09 or so, and then veer off wildly. Funny thing is that both Erasmas and Orolo are running the same version of TN93 (v1.0.6)

ArtPoon commented 9 months ago

Using new TN93 data gives similar but slightly altered results:

ArtPoon commented 9 months ago

Repeating above with revised TN93 data:

> subG <- res[[10]]
> cd10 <- cluster.data[cluster.data$SetID==10,]
> dim(cd10)
[1] 521  10
> table(cd10$Size)

  1   2   3   4   5   6   8   9  10  16 
441  49  16   2   2   4   3   1   2   1 
> table(cd10$Growth)

  0   1   2   3  12 
495  22   2   1   1 
> table(subG$c)

  1   2   3   4   5   6   8   9  10  16 
441  49  16   2   2   4   3   1   2   1 
> table(subG$g)

  0   1   2   3  12 
493  22   3   2   1 
> table(cd10$Size, cd10$Growth)

       0   1   2   3  12
  1  429  12   0   0   0
  2   41   8   0   0   0
  3   14   1   1   0   0
  4    2   0   0   0   0
  5    2   0   0   0   0
  6    4   0   0   0   0
  8    2   0   1   0   0
  9    1   0   0   0   0
  10   0   1   0   0   1
  16   0   0   0   1   0
> table(subG$c, subG$g)

       0   1   2   3  12
  1  428  12   1   0   0
  2   40   8   0   1   0
  3   14   1   1   0   0
  4    2   0   0   0   0
  5    2   0   0   0   0
  6    4   0   0   0   0
  8    2   0   1   0   0
  9    1   0   0   0   0
  10   0   1   0   0   1
  16   0   0   0   1   0

ArtPoon commented 9 months ago

Exporting edge list for rendering with Graphviz:

> foo <- edge.info[edge.info$Distance < cutoffs[10],]
> dim(foo)
[1] 502   3
> head(foo)
                            ID1                        ID2    Distance
4    KU190152.1_2008-12-09_subB KU190153.1_2008-12-09_subB 0.001970860
269  KU190360.1_2009-12-22_subB KU190378.1_2010-02-09_subB 0.006938390
369  KU190256.1_2009-03-17_subB KU190259.1_2009-03-24_subB 0.002965970
838  KU190528.1_2011-08-11_subB KU190541.1_2011-09-08_subB 0.000000000
2055 KU190495.1_2011-04-07_subB KU190535.1_2011-08-22_subB 0.003944520
3350 KU190495.1_2011-04-07_subB KU190582.1_2011-05-11_subB 0.000985765
> temp <- sapply(foo$ID1, function(x) strsplit(x, '-')[[1]][1])
> head(temp)
KU190152.1_2008-12-09_subB KU190360.1_2009-12-22_subB KU190256.1_2009-03-17_subB 
         "KU190152.1_2008"          "KU190360.1_2009"          "KU190256.1_2009" 
KU190528.1_2011-08-11_subB KU190495.1_2011-04-07_subB KU190495.1_2011-04-07_subB 
         "KU190528.1_2011"          "KU190495.1_2011"          "KU190495.1_2011" 
> foo$ID1 <- sapply(foo$ID1, function(x) strsplit(x, '-')[[1]][1])
> foo$ID2 <- sapply(foo$ID2, function(x) strsplit(x, '-')[[1]][1])
> write.csv(foo, "iss12.edges.csv", row.names=F)

ArtPoon commented 9 months ago

import csv

max_year = '2013'
reader = csv.DictReader(open("iss12.edges.csv"))

outfile = open("iss12.dot", 'w')
outfile.write("graph iss12 {\n\toutputorder=\"edgesfirst\";\n\tnode [shape=rect style=filled];")

all_ids = set()

for row in reader:
  # write edges
  id1, year1 = row['ID1'].split('_')
  if id1 not in all_ids:
    all_ids.add(id1)
    outfile.write(f"\t\"{id1}\" [fillcolor=\"{'red' if year1==max_year else 'white'}\"];\n")

  id2, year2 = row['ID2'].split('_')
  if id2 not in all_ids:
    all_ids.add(id2)
    outfile.write(f"\t\"{id2}\" [fillcolor=\"{'red' if year2==max_year else 'white'}\"];\n")

  dist = float(row['Distance'])
  outfile.write(f"\t\"{id1}\"--\"{id2}\" [len={0.7+200*dist}];\n")

outfile.write("}\n")
outfile.close()

ArtPoon commented 9 months ago

Investigating the cluster of 2 that should grow by 3 first:

> which(subG$c==2 & subG$g==3)
384 
384 
> subG$v[subG$v$Cluster==384, ]
               ID Time     Weight Cluster
42107  KU190552.1 2011 0.09721059     384
106959 KU190594.1 2011 0.09721059     384
207857 KU190733.1 2013 0.05985794     384
215766 KU190774.1 2013 0.05985794     384
224750 KU190811.1 2013 0.05985794     384
> which(grepl("KU190552", cd10$id))
[1] 96
> cd10[96,]
   ClusterID                                                Header
1:        96 KU190552.1_2011-12-01_subB,KU190594.1_2011-07-13_subB
                      id   colyear Cluster Size Growth DistThresh SetID RangeID
1: KU190552.1,KU190594.1 2011,2011      96    2      0     0.0072    10       0

This cluster is definitely in the data - excerpt from DOT graph generated above (filtering edges on distance threshold of 0.0072):

ArtPoon commented 9 months ago

Note I had to modify v1.0 comp_Lib.R a bit to handle the new headers:

3c3
< 
---
> require(lubridate)
13a14
>   
20,21c21,22
<   el <- data.frame(ID1=as.character(temp1), t1=as.numeric(temp2), 
<                    ID2=as.character(temp3), t2=as.numeric(temp4), 
---
>   el <- data.frame(ID1=as.character(temp1), t1=year(as.Date(temp2)), 
>                    ID2=as.character(temp3), t2=year(as.Date(temp4)),

ArtPoon commented 9 months ago

Step through component.cluster to see where these edges are disappearing:

> dist.thresh <- 0.0072
> edge.info <- read.csv("data/chato2020/naFull-subB.tn93.csv")
> headers <- unique(c(edge.info$ID1, edge.info$ID2))
> seq.info <- parse.headers(headers, sep="_", var.names = c("id", "coldate", "subtype"),
+                           var.transformations = c(as.character, as.Date, as.factor))
> seq.info$subtype <- NULL
> seq.info$colyear <- year(seq.info$coldate)
> seq.info$coldate <- NULL
> which.new <- seq.info$colyear == max(seq.info$colyear)
> obj <- read.edges(seq.info, edge.info, which.new)
>   filtered.edges <- obj$edge.info[obj$edge.info$Distance <= dist.thresh, ]
> which(obj$seq.info$id=="KU190552.1")
[1] 104
> filtered.edges[which(filtered.edges$ID1==104),]
      ID1 ID2   Distance
43515 104 246 0.00295946
> filtered.edges[which(filtered.edges$ID2==104),]
[1] ID1      ID2      Distance
<0 rows> (or 0-length row.names)
> obj$seq.info[246,]
                       Header         id colyear   New
1: KU190594.1_2011-07-13_subB KU190594.1    2011 FALSE
> subG$e[which(subG$e$ID1=="KU190594.1"),]  # the other known case in this cluster
             ID1   t1        ID2   t2   Distance tMax tDiff
6604  KU190594.1 2011 KU190733.1 2013 0.00594669 2013     2
68653 KU190594.1 2011 KU190774.1 2013 0.00395159 2013     2
7334  KU190594.1 2011 KU190811.1 2013 0.00593821 2013     2
> subG$e[which(subG$e$ID2=="KU190594.1"),]
             ID1   t1        ID2   t2   Distance tMax tDiff
43515 KU190552.1 2011 KU190594.1 2011 0.00295946 2011     0
> which(obj$seq.info$id=="KU190594.1")
[1] 246
> filtered.edges[which(filtered.edges$ID2==246),]
      ID1 ID2   Distance
43515 104 246 0.00295946
> filtered.edges[which(filtered.edges$ID1==246),]
[1] ID1      ID2      Distance
<0 rows> (or 0-length row.names)

OK, so filtered.edges is supposed to contain edges to new cases but it does not. There's something wrong in minimum.retrospective.edge.

ArtPoon commented 9 months ago

Found the bug. minimum.retrospective.edge is including edges between new cases for its selection:

> which(seq.info$id %in% c("KU190733.1", "KU190774.1", "KU190811.1"))
[1] 525 564 620
>   # extract row index for shortest edge for each new node
>   min.retro.edges <- sapply(new.seqs, function(new.seq) {
+     my.edges <- c(which(obj$edge.info$ID1 == new.seq | obj$edge.info$ID2 == new.seq))
+     my.subset <- obj$edge.info[my.edges, ]
+     my.edges[which.min(my.subset$Distance)]
+   })
> mins <- obj$edge.info[min.retro.edges,]
> mins[mins$ID2 %in% c(525, 564, 620),]
         ID1 ID2   Distance
209815   525 620 0.00098504
216730   564 620 0.00197086
209815.1 525 620 0.00098504

ArtPoon commented 9 months ago

This fixes the size/growth discrepancy:

> cd10 <- cluster.data[cluster.data$SetID==10,]
> table(cd10$Size, cd10$Growth)

       0   1   2   3  12
  1  428  12   1   0   0
  2   40   8   0   1   0
  3   14   1   1   0   0
  4    2   0   0   0   0
  5    2   0   0   0   0
  6    4   0   0   0   0
  8    2   0   1   0   0
  9    1   0   0   0   0
  10   0   1   0   0   1
  16   0   0   0   1   0

ArtPoon commented 9 months ago

cluster growth trends are the same

plot(cutoffs, sapply(res, function(x) sum(x$g)), type='l',
     ylab="Number of new cases in clusters")
lines(cutoffs, sapply(split(cluster.data$Growth, cluster.data$DistThresh), sum), col='red')

number of active clusters is the same:

plot(cutoffs, sapply(res, function(x) sum(x$g>0)), type='l',
     ylab="Number of active clusters")
lines(cutoffs, sapply(
  split(cluster.data$Growth, cluster.data$DistThresh), 
  function(x) sum(x>0)
  ), col='red')

but dAIC profiles are still different:

> par(mar=c(5,5,1,1))
> plot(cutoffs, gaics, type='o', pch=19, cex=0.8)
> abline(a=0, b=1, lty=2)
> lines(aic$dist.thresh, aic$AltModel-aic$NullModel, col='red', pch=19, cex=0.8,
+       type='o')

ArtPoon commented 9 months ago

Next place to look is the regression model fits. These are already different:

> model.fits$NullModel[10]
[[1]]

Call:  glm(formula = Growth ~ Size, family = "poisson", data = x)

Coefficients:
(Intercept)         Size  
    -3.2347       0.3402  

Degrees of Freedom: 520 Total (i.e. Null);  519 Residual
Null Deviance:      304.4 
Residual Deviance: 216.1    AIC: 282.3

> res[[10]]$nullFit

Call:  glm(formula = Growth ~ Pred, family = "poisson", data = df2)

Coefficients:
(Intercept)         Pred  
     -3.235        5.161  

Degrees of Freedom: 520 Total (i.e. Null);  519 Residual
Null Deviance:      304.4 
Residual Deviance: 216.1    AIC: 282.3

Note the AIC values are the same but the coefficients are different. This is because we used to apply weights to cluster sizes:

> fit1 <- res[[10]]$nullFit
> fit2 <- model.fits$NullModel[[10]]
> table(fit2$data$Size)

  1   2   3   4   5   6   8   9  10  16 
441  49  16   2   2   4   3   1   2   1 
> table(fit1$data$Pred)

0.0659025787965616  0.131805157593123  0.197707736389685  0.263610315186246 
               441                 49                 16                  2 
 0.329512893982808   0.39541547277937  0.527220630372493  0.593123209169054 
                 2                  4                  3                  1 
 0.659025787965616   1.05444126074499 
                 2                  1

ArtPoon commented 9 months ago

v1.0 GLMs are fit here: https://github.com/PoonLab/clustuneR/blob/ef294dcf1f4121f5ec389257a24faaf403da4a4f/scripts/comp_Lib.R#L238-L242

df2 is the null model, where growth is predicted from the cluster size as a proportion of the total number of known cases. Since the regression is over clusters, the total numbers are irrelevant and the same regression fit is obtained from glm(Growth ~ c).

However, something goes awry when we add cluster recency. In v1.0, we derived a predictor variable from the expected decay in the probability of an edge between cases separated by time $\delta t$ (see Chato et al. 2020, equations 3 and 4). In the current branch we are taking the collection year directly as a predictor variable.

The expected edge probability is not simply a transformation of collection year because it is a function of the distance threshold.

ArtPoon commented 9 months ago

I'm going to peel off the implementation of the edge probability model into another issue, and redirect the focus of this issue to fixing the tree clustering.

ArtPoon commented 9 months ago

####################
# TREE CLUSTERING

# reproduce with old code
# https://github.com/PoonLab/tn/tree/master#readme
setwd("~/git/tn")
source("pplacer_utils.R")
input.tree <- "data/na.nwk"       
tree.log <- "data/na.log"
full.alignment <- "data/na.fasta"
out.ref <- "data/na.refpackage"
program <- "IQ-TREE"

taxitCreate(input.tree, tree.log, full.alignment, out.ref, program)
pplacer_guppy(out.ref, "data/")

source("subT_Lib.R")

date.format <- "%Y-%m-%d"
var.indices <- c(1,2,3)
sep <- '_'
extra.vars = c("Subtype")
growth.info <- "data/na_growth.tree"

extended.tree <- import.tree(input.tree, var.indices, date.format, sep, extra.vars)
extended.tree$g <- growthSim(extended.tree, growth.info)

distance.thresholds <- seq(0, 0.04, 0.001)
bootstrap.requirements <- 0.95
cores <- 4
clustering.function <- STClu

clus <- multiSTClu(extended.tree, distance.thresholds, bootstrap.requirements, 
                   cores, clustering.function)

proposed.var <- "Time"
runID <- 0
proposed.formula <- New~Old+Time
var.translation <- list(function(x){mean(x)})

v1res <- GAICRun(clus, runID, cores, proposed.formula, proposed.var, var.translation)
plot(distance.thresholds, v1res$GAIC, type='l')

###################### 

# re-run with current code
require(ape)
require(lubridate)

setwd("~/git/clustuneR")
seqs <- ape::read.FASTA("~/git/tn/data/na.fasta")

source("R/sequence.setup.R")
seq.info <- parse.headers(
  names(seqs), sep="_", var.names=c('accession', 'coldate', 'subtype'),
  var.transformations=c(as.character, as.Date, as.factor))

max.year <- max(year(seq.info$coldate))
old.seqs <- seqs[year(seq.info$coldate) < max.year]

phy <- ape::read.tree("data/na.nwk")
source("R/tree.setup.R")
phy.extend <- extend.tree(phy, seq.info, seqs, log.file="data/na.log")

param.list <- lapply(seq(0, 0.04, 0.001), function(x) list(branch.thresh=x, boot.thresh=0.95))
source("R/tree.clustering.R")
source("R/analysis.R")
cluster.sets <- multi.cluster(phy.extend, param.list, step.cluster) 

p.models = list(
  "NullModel" = function(x) glm(Growth~Size, data=x, family="poisson"),
  "TimeModel" = function(x) glm(Growth~Size+coldate, data=x, family="poisson")
)
p.trans = list("coldate" = mean)

v2res <- fit.analysis(cluster.sets, predictive.models = p.models, 
                      predictor.transformations = p.trans)
v2.gaic <- sapply(v2res$TimeModel, function(x) x$aic) - 
  sapply(v2res$NullModel, function(x) x$aic)
lines(seq(0, 0.04, 0.001), v2.gaic, col='red')

ArtPoon commented 9 months ago

> cd10 <- cluster.sets[cluster.sets$SetID==10, ]
> head(cd10)
   ClusterID                     Header  accession    coldate subtype Descendants
1:         1 KU189996.1_2008-06-12_subA KU189996.1 2008-06-12    subA           1
2:         2 KU189997.1_2008-09-09_subA KU189997.1 2008-09-09    subA           2
3:         3 KU189998.1_2008-09-10_subA KU189998.1 2008-09-10    subA           3
4:         4 KU189999.1_2008-11-04_subA KU189999.1 2008-11-04    subA           4
5:         5 KU190000.1_2008-11-04_subA KU190000.1 2008-11-04    subA           5
6:         6 KU190001.1_2008-05-06_subA KU190001.1 2008-05-06    subA           6
   Size Growth BranchThresh BootThresh SetID RangeID
1:    1      0        0.009       0.95    10       0
2:    1      0        0.009       0.95    10       0
3:    1      0        0.009       0.95    10       0
4:    1      0        0.009       0.95    10       0
5:    1      0        0.009       0.95    10       0
6:    1      1        0.009       0.95    10       0
> cd10.old <- clus[clus$SetID==10, ]
> head(cd10.old)
   ID Old New Membership  MaxD MinB SetID Growing       Time Subtype
1:  1   1   0 KU189996.1 0.009 0.95    10   FALSE 2008-06-12    subA
2:  9   1   0 KU190020.1 0.009 0.95    10   FALSE 2011-10-26    subA
3: 10   1   0 KU190026.1 0.009 0.95    10   FALSE 2012-09-05    subA
4: 13   1   0 KU190031.1 0.009 0.95    10   FALSE 2007-01-03    subB
5: 14   1   0 KU190389.1 0.009 0.95    10   FALSE 2010-03-02    subB
6: 24   1   0 KU190147.1 0.009 0.95    10   FALSE 2008-12-02    subB
> dim(cd10)
[1] 724  12
> dim(cd10.old)
[1] 599  10

Immediately we see a discordance in the number of rows in the respective data.table objects for a given distance threshold (0.009).

Here's the breakdown of cluster size and cluster growth:

> table(cd10.old$Old, cd10.old$New)

        0   1   2   3   9
  1   494  14   1   1   0
  2    63   2   0   0   0
  3    10   2   1   0   0
  4     2   0   0   0   0
  6     3   2   0   0   0
  7     0   1   0   0   0
  8     1   0   0   0   0
  10    0   1   0   0   0
  157   0   0   0   0   1
> table(cd10$Size, cd10$Growth)

      0   1   2   3  11
  1 599  18   1   1   0
  2  66   9   0   0   0
  3  14   0   1   0   0
  4   4   2   0   0   0
  5   2   1   0   0   0
  7   2   1   0   0   0
  8   2   0   0   0   0
  9   0   0   0   0   1

ArtPoon commented 9 months ago

I guess let's start at the earliest step, the annotation of growing tips in the tree using pplacer outputs:

> head(phy.extend$growth.info)
                       Header Bootstrap TermDistance PendantDistance Terminal
1: KU191034.1_2013-05-01_subC  1.000000   0.01825170     2.03131e-02     TRUE
2: KU190820.1_2013-10-21_subB  0.141822   0.00307456     1.09875e-06    FALSE
3: KU190820.1_2013-10-21_subB  0.142939   0.00307457     1.45385e-06     TRUE
4: KU190820.1_2013-10-21_subB  0.143128   0.00307457     1.29545e-06    FALSE
5: KU190820.1_2013-10-21_subB  0.142177   0.00307456     9.72928e-04    FALSE
6: KU190820.1_2013-10-21_subB  0.143317   0.00307457     1.45990e-06    FALSE
   NeighbourNode Cluster
1:           720     720
2:          1404    1399
3:           480     480
4:          1403    1399
5:          1405    1399
6:          1402    1399
> head(extended.tree$g)
                          nID   TermDist     PenDist oConn Bootstrap
1: KU191034.1_2013-05-01_subC 0.01825170 2.03131e-02   720  1.000000
2: KU190820.1_2013-10-21_subB 0.00307456 1.09875e-06  1404  0.141822
3: KU190820.1_2013-10-21_subB 0.00307457 1.45385e-06   480  0.142939
4: KU190820.1_2013-10-21_subB 0.00307457 1.29545e-06  1403  0.143128
5: KU190820.1_2013-10-21_subB 0.00307456 9.72928e-04  1405  0.142177
6: KU190820.1_2013-10-21_subB 0.00307457 1.45990e-06  1402  0.143317
> all(extended.tree$g$nID == phy.extend$growth.info$Header)
[1] FALSE
> all(sort(extended.tree$g$nID) == sort(phy.extend$growth.info$Header))
[1] TRUE
> new.px <- phy.extend$growth.info[order(phy.extend$growth.info$Header), ]
> old.px <- extended.tree$g[order(extended.tree$g$nID), ]
> all(old.px$nID == new.px$Header)
[1] TRUE
> all(old.px$TermDist == new.px$TermDistance)
[1] FALSE
> plot(old.px$TermDist, new.px$TermDistance)

ArtPoon commented 9 months ago

I thought the discrepancy might be due to random ordering of multiple graft sites for a given tip, but this does not appear to be the case:

> old.px[4:7,]
                          nID  TermDist    PenDist oConn Bootstrap
1: KU190029.1_2013-09-10_subA 0.0612241 0.00412592   907 0.1701670
2: KU190029.1_2013-09-10_subA 0.0613435 0.00607536   911 0.0926687
3: KU190029.1_2013-09-10_subA 0.0620539 0.00952974   906 0.0849338
4: KU190029.1_2013-09-10_subA 0.0523765 0.02610240    11 0.2086720
> old.px[old.px$nID=="KU190029.1_2013-09-10_subA",]
                          nID  TermDist     PenDist oConn Bootstrap
1: KU190029.1_2013-09-10_subA 0.0611702 6.74464e-06   909 0.0256374
2: KU190029.1_2013-09-10_subA 0.0604680 5.83353e-06   908 0.2477350
3: KU190029.1_2013-09-10_subA 0.0612240 3.19437e-03   910 0.1701850
4: KU190029.1_2013-09-10_subA 0.0612241 4.12592e-03   907 0.1701670
5: KU190029.1_2013-09-10_subA 0.0613435 6.07536e-03   911 0.0926687
6: KU190029.1_2013-09-10_subA 0.0620539 9.52974e-03   906 0.0849338
7: KU190029.1_2013-09-10_subA 0.0523765 2.61024e-02    11 0.2086720
> new.px[new.px$Header=="KU190029.1_2013-09-10_subA",]
                       Header Bootstrap TermDistance PendantDistance Terminal
1: KU190029.1_2013-09-10_subA 0.0256375    0.0611702     6.74464e-06    FALSE
2: KU190029.1_2013-09-10_subA 0.2477330    0.0604680     5.83353e-06    FALSE
3: KU190029.1_2013-09-10_subA 0.1701860    0.0612240     3.19987e-03    FALSE
4: KU190029.1_2013-09-10_subA 0.1701680    0.0612285     4.12592e-03    FALSE
5: KU190029.1_2013-09-10_subA 0.0926690    0.0613435     6.07536e-03    FALSE
6: KU190029.1_2013-09-10_subA 0.0849340    0.0620539     9.52974e-03    FALSE
7: KU190029.1_2013-09-10_subA 0.2086730    0.0523774     2.60981e-02     TRUE
   NeighbourNode Cluster
1:           909      NA
2:           908      NA
3:           910      NA
4:           907      NA
5:           911      NA
6:           906      NA
7:            11      NA
> which(old.px$TermDist != new.px$TermDistance)
 [1]   4   7  20  21  46  57  66  83  97 104 106 128 129 136 144 149 167 170 172
[20] 174 188 194 195 196 197 198 199 200 201 202 203 204 205 206 207 229 234 235
[39] 245 249 311 320 354 371 388 407 413 414 416 425 429 437 443 449 460 465 467
[58] 481 491 512 518 529 530 537 541 552 557 562 567
> x <- old.px$TermDist[old.px$nID=="KU190029.1_2013-09-10_subA"]
> y <- new.px$TermDistance[new.px$Header=="KU190029.1_2013-09-10_subA"]
> sort(x)
[1] 0.0523765 0.0604680 0.0611702 0.0612240 0.0612241 0.0613435 0.0620539
> sort(y)
[1] 0.0523774 0.0604680 0.0611702 0.0612240 0.0612285 0.0613435 0.0620539

Going to see if I can feed the pplacer output from one analysis directly into the other instead of calling the program twice.

ArtPoon commented 9 months ago

There does not appear to be any difference in outputs between runs of pplacer:

> gs1 <- growthSim(extended.tree, "~/git/tn/data/na_growth.tree")
> gs2 <- growthSim(extended.tree, "~/git/clustuneR/data/chato2022/na_growth.tree")
> dim(gs1)
[1] 569   5
> dim(gs2)
[1] 569   5
> head(gs1)
                          nID   TermDist     PenDist oConn Bootstrap
1: KU190820.1_2013-10-21_subB 0.00307456 1.09875e-06  1404  0.141822
2: KU190820.1_2013-10-21_subB 0.00307457 1.45385e-06   480  0.142939
3: KU190820.1_2013-10-21_subB 0.00307457 1.29545e-06  1403  0.143128
4: KU190820.1_2013-10-21_subB 0.00307456 9.72928e-04  1405  0.142177
5: KU190820.1_2013-10-21_subB 0.00307457 1.45990e-06  1402  0.143317
6: KU190820.1_2013-10-21_subB 0.00307457 9.78450e-04  1407  0.143314
> head(gs2)
                          nID   TermDist     PenDist oConn Bootstrap
1: KU191034.1_2013-05-01_subC 0.01825170 2.03131e-02   720  1.000000
2: KU190820.1_2013-10-21_subB 0.00307456 1.09875e-06  1404  0.141822
3: KU190820.1_2013-10-21_subB 0.00307457 1.45385e-06   480  0.142939
4: KU190820.1_2013-10-21_subB 0.00307457 1.29545e-06  1403  0.143128
5: KU190820.1_2013-10-21_subB 0.00307456 9.72928e-04  1405  0.142177
6: KU190820.1_2013-10-21_subB 0.00307457 1.45990e-06  1402  0.143317
> all(sort(gs1$TermDist) == sort(gs2$TermDist))
[1] TRUE
> all(sort(gs1$PenDist) == sort(gs2$PenDist))
[1] TRUE
> # reordering is not stochastic outcome of `growthSim`
> gs2 <- growthSim(extended.tree, "~/git/clustuneR/data/chato2022/na_growth.tree")
> head(gs2)
                          nID   TermDist     PenDist oConn Bootstrap
1: KU191034.1_2013-05-01_subC 0.01825170 2.03131e-02   720  1.000000
2: KU190820.1_2013-10-21_subB 0.00307456 1.09875e-06  1404  0.141822
3: KU190820.1_2013-10-21_subB 0.00307457 1.45385e-06   480  0.142939
4: KU190820.1_2013-10-21_subB 0.00307457 1.29545e-06  1403  0.143128
5: KU190820.1_2013-10-21_subB 0.00307456 9.72928e-04  1405  0.142177
6: KU190820.1_2013-10-21_subB 0.00307457 1.45990e-06  1402  0.143317

ArtPoon commented 9 months ago

Here I am bypassing the call to pplacer by clustuneR and feeding in the output from tn:

> growth.info.from.tn <- annotate.growth(phy, growth.info.trees)
Warning message:
In annotate.growth(phy, growth.info.trees) :
  Not all newly added sequences are noted in the seq.info of the tree
> head(growth.info.from.tn)
                       Header Bootstrap TermDistance PendantDistance Terminal
1: KU190820.1_2013-10-21_subB  0.141822   0.00307456     1.09875e-06    FALSE
2: KU190820.1_2013-10-21_subB  0.142939   0.00307457     1.45385e-06     TRUE
3: KU190820.1_2013-10-21_subB  0.143128   0.00307457     1.29545e-06    FALSE
4: KU190820.1_2013-10-21_subB  0.142177   0.00307456     9.72928e-04    FALSE
5: KU190820.1_2013-10-21_subB  0.143317   0.00307457     1.45990e-06    FALSE
6: KU190820.1_2013-10-21_subB  0.143314   0.00307457     9.78450e-04    FALSE
   NeighbourNode
1:          1404
2:           480
3:          1403
4:          1405
5:          1402
6:          1407
> all(sort(gs1$TermDist) == sort(growth.info.from.tn$TermDist))
[1] TRUE
> all(sort(old.px$TermDist) == sort(growth.info.from.tn$TermDistance))
[1] TRUE
> all(sort(new.px$TermDist) == sort(growth.info.from.tn$TermDistance))
[1] FALSE

This suggests that there's something wrong with the extend.tree function.

ArtPoon commented 9 months ago

Running this again:

> setwd("~/git/tn")
> source("pplacer_utils.R")
> input.tree <- "data/na.nwk"       
> tree.log <- "data/na.log"
> full.alignment <- "data/na.fasta"
> out.ref <- "data/na.refpackage"
> program <- "IQ-TREE"
> date.format <- "%Y-%m-%d"
> var.indices <- c(1,2,3)
> sep <- '_'
> extra.vars = c("Subtype")
> growth.info <- "data/na_growth.tree"
> 
> extended.tree <- import.tree(input.tree, var.indices, date.format, sep, extra.vars)
Warning message:
In type.convert.default(addVars) :
  'as.is' should be specified by the caller; using TRUE
> extended.tree$g <- growthSim(extended.tree, growth.info)
> setwd("~/git/clustuneR")
> seqs <- ape::read.FASTA("~/git/tn/data/na.fasta")
> 
> source("R/sequence.setup.R")
> seq.info <- parse.headers(
+     names(seqs), sep="_", var.names=c('accession', 'coldate', 'subtype'),
+     var.transformations=c(as.character, as.Date, as.factor))
> phy <- ape::read.tree("data/na.nwk")
> # stepping through extend.tree()
> phy <- ape::read.tree("data/na.nwk")
>   # Midpoint root for consistency and resolve multichotomies
>   phy <- phytools::midpoint.root(phy)
>   phy <- ape::multi2di(phy)
>   var.names <- colnames(seq.info)
>   # label new sequences
>   seq.info$New <- !(seq.info$Header %in% phy$tip.label)
>   phy$seq.info <- seq.info
>   phy$node.info <- annotate.nodes(phy)
>   phy$path.info <- annotate.paths(phy)
> growth.info.trees <- ape::read.tree("~/git/tn/data/na_growth.tree")
>     phy$growth.info <- annotate.growth(phy, growth.info.trees)
Warning message:
In annotate.growth(phy, growth.info.trees) :
  Not all newly added sequences are noted in the seq.info of the tree
> all(sort(extended.tree$g$TermDist) == sort(phy$growth.info$TermDistance))
[1] TRUE
> growth.info.trees <- ape::read.tree("~/git/clustuneR/data/chato2022/na_growth.tree")
>     phy$growth.info <- annotate.growth(phy, growth.info.trees)
Warning message:
In annotate.growth(phy, growth.info.trees) :
  Not all newly added sequences are noted in the seq.info of the tree
> all(sort(extended.tree$g$TermDist) == sort(phy$growth.info$TermDistance))
[1] TRUE

ArtPoon commented 9 months ago

Made a lot of progress writing unit tests and revising code in main branch. This is a useful bit of code for viewing the tree:

require(ggfree)
L <- tree.layout(phy)
plot(L, cex=0.6)
points(L, pch=21, bg='white', cex=2)
text(L$edges$x1, L$edges$y0, label=L$edges$child)
text(x=(L$edges$x0+L$edges$x1)/2, y=L$edges$y0+0.1, 
     label=round(L$edges$length, 5), cex=0.7)
text(x=(L$edges$x0+L$edges$x1)/2, y=L$edges$y0-0.1, 
     label=L$nodes$label[L$edges$child], cex=0.7)

ArtPoon commented 9 months ago

Current test results for divergence in graph clustering:

# v1.0 (MountainPlot) version
source("tests/testthat/helper_iss22.R")
iFile <- "data/chato2020/tn93NAsubB.txt"
iG <- impTN93(iFile)  # note have to revert to as.numeric
cutoffs <- seq(0, 0.04, length.out=50) 
res1 <- gaicRun(iG, cutoffs)
gaic1 <- sapply(res1, function(x) x$gaic)

# current main branch
edgelist <- read.csv(iFile)
headers <- unique(c(edgelist$ID1, edgelist$ID2))
seq.info <- parse.headers(headers, sep="_", var.names=c("accn", "colyear"),
                          var.transformations=c(as.character, as.integer))
which.new <- (seq.info$colyear == 2013)
obj <- read.edges(edgelist, seq.info, which.new)
times <- obj$seq.info$colyear

param.list <- lapply(1:length(cutoffs), function(i) {
  list(dist.thresh=cutoffs[i], setID=i, time.var="colyear")
  })

cluster.data <- multi.cluster(obj, param.list, component.cluster)

ptrans <- list("Weight"=sum)
pmods <- list(
  "NullModel"=function(x) glm(Growth~Size, data=x, family="poisson"),
  "AltModel"=function(x) glm(Growth~Weight, data=x, family="poisson")
)
res2 <- fit.analysis(cluster.data, transforms=ptrans, models=pmods)
gaic2 <- get.AIC(res2, param.list)

plot(cutoffs, gaic1, type='l', col='red')
abline(h=0, lty=2)
lines(cutoffs, gaic2$AltModel-gaic2$NullModel, col='blue')

Getting a lot closer...

PoonLab / clustuneR

Discordant results between v1.0 and main branch #12