Closed ConnorChato closed 9 months ago
Currently this does not match the results obtained with previous component clustering work. My current suspicion is that the re-written component-cluster function needs some attention.
Total growth does not increase monotonically with component clustering threshold
(from example)
# generate cluster sets under varying parameter settings
param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) {list(g=g, dist.thresh=x)})
cluster.sets <- multi.cluster(component.cluster, param.list)
growth_total <- sapply(1:40, function(i){sum((cluster.sets[SetID == i])$Growth)})
all(growth_total == cummax(growth_total)) ## FALSE - SHOULD BE TRUE
@ConnorChato can I ask @SandeepThokala to assist with this? i.e., by comparing the MountainPlot.R
script from version v1.0
to the current component clustering script?
Yes that would be wonderful - just got back from time away so there is a bit of a project queue.
I've isolated the issue down to the clustering algorithm itself - I suspect that https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R would not produce the same cluster set as the compClu()
function from https://github.com/PoonLab/tn/blob/master/comp_Lib.R or the MountainPlot.R
code.
I've isolated the issue down to the clustering algorithm itself - I suspect that https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R would not produce the same cluster set as the
compClu()
function from https://github.com/PoonLab/tn/blob/master/comp_Lib.R or theMountainPlot.R
code.
@SandeepThokala can you please focus on this?
I think at the time we were developing this, there was a focus on the tree-based clustering that left some of these algorithms a bit less tested. I am certain the problem is compClu()
though, as it doesn't make sense for total cluster growth to be decreasing with an increasing threshold.
@SandeepThokala we're going to have to walk through code changes starting from MountainPlot.R
to graph.clustering.R
(and associated scripts) to see where outputs change for the same test data.
@SandeepThokala can you please record your work through the commit history in this PR
@SandeepThokala determined that there was a major shift in the code base between commits fde8762
and 890c92e
. It does not seem possible to run the example in the latter commit. My suggestion is to move forward from 890c92e
until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.
Hi @ArtPoon @SandeepThokala - I have a couple of starting fixes I did to correct the component clustering algorithm used to determine components in the new code here. There were a few different problems in the way it was written, so I can provide a bit of specificity from the investigation I've done in my off time.
Here is an overcommented code chunk from the function that I've determined is misbehaving (component.cluster()
from R/graph.clustering.R
. Hopefully it can be helpful in some way
# A clustering algorithm to determine disconnected components
# Uses a table of sequence metadata (g$seq.info) and a matrix of edges (filtered.edges).
# If filtered.edges[i, j] is TRUE then there is an edge between node i and node j (which will both have a row in g$seq.info).
# Assign a "Cluster" to each node. This is a unique number.
# To start, each node is in its own, unique cluster id
g$seq.info[, "Cluster" := 1:nrow(g$seq.info)]
# Initialize a variable that tracks the previous cluster of each node. This is used as a halting condition.
previous.cluster <- rep(0, nrow(g$seq.info))
# (Tragically) Uses a while loop - I am hoping there is a clever way to vectorize this, but I haven't found it yet.
# We will keep updating cluster memberships until things stop changing
while (any(g$seq.info$Cluster != previous.cluster)) {
# Update previous cluster info to current cluster info
previous.cluster <- g$seq.info$Cluster
# Reassign each node's cluster
g$seq.info[, Cluster := sapply(1:nrow(g$seq.info), function(i) {
# Look at the clusters of the nodes that node "i" is connected to
# Node i will never be connected to itself.
x <- g$seq.info[which(filtered.edges[i, ]), Cluster]
if (length(x) == 0) {
return(i) # If node i is not connected to any other nodes, it stays in it's current cluster
} else {
return(min(x)) # If connected to other nodes, node i's cluster becomes one of it's neighbour's clusters
}
})]
}
When figuring out WHICH neighbour's cluster node i should use, it decides on the cluster id that is the lowest number. I'm not exactly sure how best to resolve, but this logic can lead to infinite loops where neighbours trade their cluster id back and forth.
Actually in re-describing that issue, I may have found a solution that generates some expected results.
(From Northern Alberta data)
Since we are using a slightly different method of tn93 calculation in this code to simplify dependancies, I would still expect some slight differences between this and the original results, but I hope this is fairly similar to what you two find (@SandeepThokala @ArtPoon). Apologies if this duplicates or undoes any of your current work. There are likely much more elegant (and fast) solutions to this same issue that could replace this method anyway, I just wanted to finish up my revisit to this function.
The old function impTN93
creates a list of minimum retrospective edges minE
by following the steps below:
v
.Where as the new function create.graph
calls minimum.retrosepctive.edge
which just gets closest retrospective edges of only new vertices (not for all "not old").
https://github.com/PoonLab/clustuneR/blob/353d3fae1357de6eff597c61713e13e9465c6738/R/graph.setup.R#L54-L62
Hi @SandeepThokala. The only point where it's truly necessary to find minimum retrospective edges is when looking at new cases linking to old cases. For the previous version, I calculated the minimum retrospective edge from every node because we were interested in sequentially re-defining the newest year, so any old year may eventually get called "New" and at that point we'd want it's minimum retrospective edge.
It does not seem possible to run the example in the latter commit. My suggestion is to move forward from
890c92e
until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.
I tried to run tn93 on data/na.fasta to see if we're using the same input as example_tn93.txt
tn93 -o data/test_tn93.txt data/na.fasta
But the result was only 1308 edges long, where as example_tn93.txt
has 1306536 edges.
It does not seem possible to run the example in the latter commit. My suggestion is to move forward from
890c92e
until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.I tried to run tn93 on data/na.fasta to see if we're using the same input as example_tn93.txt
tn93 -o data/test_tn93.txt data/na.fasta
But the result was only 1308 edges long, where asexample_tn93.txt
has 1306536 edges.
tn93
has a threshold option -t
which I believe is 0.015. Typically when we run tn93
we would use an ultra permissive threshold to look at the maximum number of edges (something like tn93 -t 1 -o data/test_tn93.txt data/na.fasta
)
Uploading my understanding of the functions.
The input file (example_tn93.txt) contains a huge list of comma separated edge info (ID1, ID2, distance).
impTN93
"_"
character extracting ID and time.
idf <- read.csv(iFile, stringsAsFactors = F)
temp1 <- sapply(idf$ID1, function(x) (strsplit(x,'_')[[1]])[[1]])
temp2 <- sapply(idf$ID1, function(x) (strsplit(x,'_')[[1]])[[2]])
temp3 <- sapply(idf$ID2, function(x) (strsplit(x,'_')[[1]])[[1]])
temp4 <- sapply(idf$ID2, function(x) (strsplit(x,'_')[[1]])[[2]])
g$e
and obtains maximum time and time difference for each edge resulting in a data frame with 7 columns ID1, t1, TD2, t2, Distance, tMax, tDiff
.g$v
based on ID and time columns of edge list resulting in a data frame with 2 columns ID, Time
.
el <- data.frame(ID1=as.character(temp1), t1=as.numeric(temp2),
ID2=as.character(temp3), t2=as.numeric(temp4),
Distance = as.numeric(idf$Distance),
stringsAsFactors= F)
el$tMax <- pmax(el$t1,el$t2)
el$tDiff <- abs(el$t1-el$t2)
vl <- unique(data.frame(ID = c(el$ID1, el$ID2), Time = c(el$t1, el$t2), stringsAsFactors=F))
minE
-- Gets a list of "not old" vertices.
-- Gets the closest retrospective edge, for every "not old" vertex v
.
-- Binds the rows of all closest retrospective edges of all "not old" vertices.
subV <- subset(g$v, Time>min(Time))
minE <- bind_rows(lapply(1:nrow(subV), function(i){
v <- subV[i,]
incE <- subset(g$e, (ID1%in%v$ID)|(ID2%in%v$ID))
retE <- subset(incE, (tMax==v$Time)&(tDiff>0))
retE[which(retE$Distance==min(retE$Distance))[[1]],]
}))
g$e
and add only new retrospective edges
g$e <- subset(g$e, tMax!=max(tMax))
g$e <- rbind(g$e, subset(minE, tMax==max(tMax)))
g$f
g$f <- subset(minE, tMax < max(tMax))
g
compAnalyze
subG
of graph g
, with maximum distance threshold dMax
.
dMax <- max(subG$e$Distance)
tTab
.
tTab <- table(subG$f$tMax)
vTab
.
vTab <- table(subG$v$Time)
vTab
, gets a list of number of all edges eTab
.
eTab <- sapply(as.numeric(names(vTab)), function(t){
nrow(subset(subG$e, (t1==t|t2==t)))
})
names(eTab) <- names(vTab)
Creates a list of densities ageD
, for each year in tTab
:
-- Gets a subset of retrospective edges of that year.
-- Groups the subset of edges by the differences in years of vertices.
-- For each group, gets number of edges that have a distance threshold of dMax
, Positive
.
-- For each group's year difference, gets a ratio of old year edges eTab
to number of old year edges in vTab
.
-- Creates a data frame with columns Positive, vTotal, oeDens, tDiff
.
-- Binds rows of all data frames.
ageD <- bind_rows(lapply(as.numeric(names(tTab)), function(t) {
temp <- subset(subG$f, tMax==t)
dfs <- split(temp, temp$tDiff)
Positive <- sapply(dfs, function(df){length(which(df$Distance<=dMax))})
vTotal <- rep((vTab[[as.character(t)]]),length(dfs))
tDiff <- as.numeric(names(Positive))
oeDens <- sapply(tDiff, function(tD){
oTime <- t-tD
return(eTab[as.character(oTime)]/vTab[as.character(oTime)])
})
res <- data.frame(Positive=as.numeric(Positive), vTotal=vTab[[as.character(t)]], oeDens=oeDens, tDiff)
return(res)
}))
mod <- glm(cbind(Positive, vTotal) ~ tDiff+oeDens, data=ageD, family='binomial')
subG$v$Weight <- predict(mod, type='response',
data.frame(tDiff=max(subG$v$Time)-subG$v$Time,
oeDens=as.numeric(eTab[as.character(subG$v$Time)]/vTab[as.character(subG$v$Time)])))
subG <- simGrow(subG)
cPred <- subset(subG$v, Time<max(Time))[,c("Weight", "Cluster")]
df1 <- data.frame(Growth = as.numeric(subG$g), Pred = sapply(names(subG$c), function(x) { sum(subset(cPred, Cluster==as.numeric(x))$Weight) }))
df2 <- data.frame(Growth = as.numeric(subG$g), Pred = as.numeric(subG$c) * (sum(as.numeric(subG$g))/sum(as.numeric(subG$c))))
fit1 <- glm(Growth ~ Pred, data = df1, family = "poisson")
fit2 <- glm(Growth ~ Pred, data = df2, family = "poisson")
subG$gaic <- fit1$aic-fit2$aic
subG$ageMod <- mod
subG$ageFit <- fit1
subG$nullFit <- fit2
subG$f <- ageD
subG
.@SandeepThokala reports that he was able to convert the example data file into the data structure used by the most recent code, but this conversion script it is taking a long time to process. Can you post this script here so we can troubleshoot it?
Conversion script
import numpy as np
import pandas as pd
from tqdm import tqdm
ex = "./data/example_tn93.txt"
def get_distance(id1, id2):
distance = df.loc[(df['ID1'] == id1) & (df['ID2'] == id2), 'Distance']
if not(len(distance)):
distance = df.loc[(df['ID2'] == id1) & (df['ID1'] == id2), 'Distance']
if not(len(distance)):
distance = np.nan
if isinstance(distance, pd.core.series.Series):
return distance.iloc[0]
else:
return distance
df = pd.read_csv(ex)
unique_ids = set(df['ID1'].unique())
unique_ids = unique_ids.union(set(df['ID2'].unique()))
unique_ids = np.array(list(unique_ids))
num_ids = len(unique_ids)
matrix = np.zeros((num_ids, num_ids))
matrix = np.reshape(np.append(unique_ids, matrix), (num_ids + 1, num_ids))
count = 0
for row_index, row in tqdm(enumerate(unique_ids, 1)):
for col_index, col in enumerate(unique_ids[count:], count):
if col != row:
matrix[row_index, col_index] = get_distance(row, col)
else:
matrix[row_index, col_index] = 0
count += 1
matrix = matrix + matrix.T
np.savetxt('example_matrix.csv', matrix, delimiter=',')
@SandeepThokala reports that he was able to convert the example data file into the data structure used by the most recent code, but this conversion script it is taking a long time to process. Can you post this script here so we can troubleshoot it?
This first loop of component.cluster
in R/graph.clustering.R is currently a while loop. I had hoped that it would generally limited in the number of steps that it takes, but yes it does seem to be really slow. Finding a way to vectorize this stage would likely be a big step forward.
After running the conversion script for almost 60 hours, example_matrix.csv
is generated with 1617 rows and 1617 columns.
Reading in both old and new example data files.
require(clustuneR)
require(ape)
require(lubridate)
setwd(".")
# Reading in new example file ("data/na.fasta") and generating edge info using "TN93" model
seqs <- ape::read.FASTA("data/na.fasta", type="DNA")
new.df <- ape::dist.dna(seqs, pairwise.deletion = T, as.matrix = T, model = "TN93")
# Reading in edge info generated from old example file ("data/example_tn93.txt") using conversion script
ex.csv <- "./data/example_matrix.csv"
old.df <- read.csv(ex.csv)
rownames(old.df) <- colnames(old.df)
Comparing dimensions of both dataframes.
> dim(old.df)
[1] 1617 1617
> dim(new.df)
[1] 1054 1054
The conversion script above created edge.info
list from example_tn93.txt
import pandas as pd
import numpy as np
from datetime import datetime as dt
old_txt = './data/example_tn93.txt'
def get_seq_info(old_txt):
df = pd.read_csv(old_txt)
unique_ids = set(df['ID1'].unique())
unique_ids = unique_ids.union(set(df['ID2'].unique()))
unique_ids = np.array(list(unique_ids))
seq_info = {'accession': list(map(lambda x: x.split('_')[0],
unique_ids)),
'coldate': list(map(lambda x: dt.strftime(dt.strptime(x.split('_')[1], '%Y'),
'%Y-%m-%d'),
unique_ids)),
'subtype': list(map(lambda x: np.random.choice(['subA', 'subB', 'subC']),
unique_ids))} # Assigning random subtypes
seq_info = pd.DataFrame(seq_info)
seq_info['Header'] = seq_info.apply(lambda x: f"{x['accession']}_{x['coldate']}_{x['subtype']}", axis=1)
seq_info.to_csv('./data/seq_info.csv', index=False)
get_seq_info(old_txt)
Used another conversion script to create and save seq.info
list from example_tn93.txt
# replacing usage of pull.headers
seq.info <- read.csv("./data/seq_info.csv")
max.year <- max(year(seq.info$coldate))
which.new <- which(year(seq.info$coldate) == max.year)
# replacing usage of ape::dist.dna for calculating genetic distance using TN93 model
edge.info <- read.csv("./data/edge_info.csv")
rownames(edge.info) <- colnames(edge.info)
g <- create.graph(seq.info, edge.info, which.new)
Resulted in error
Error in create.graph(seq.info, edge.info, which.new) :
The pairwise distance matrix does not contain all recognized headers
from alignment
This input validation in graph.setup.R
script caused the error
if (!all(colnames(edge.info) %in% seq.info$Header)) {
stop("The pairwise distance matrix does not contain all recognized headers
from alignment")
}
seq.info <- read.csv("./data/seq_info.csv")
seqs <- list()
for (header in seq.info$Header) {
seqs[[header]] <- header
}
seq.info <- pull.headers(seqs, sep="_", var.names=c('accession', 'coldate', 'subtype'),
var.transformations=c(as.character, as.Date, as.factor))
max.year <- max(year(seq.info$coldate))
which.new <- which(year(seq.info$coldate) == max.year)
edge.info <- read.csv("./data/edge_info.csv")
colnames(edge.info) <- lapply(colnames(edge.info), function(x) {
seq.info$Header[seq.info$accession == strsplit(x, "_")[[1]][1]]
})
rownames(edge.info) <- colnames(edge.info)
g <- create.graph(seq.info, edge.info, which.new)
Finally able to create graph data structure using above script
I've refactored a lot of the code and merged my work from iss12
branch into this one, but I don't think we're out of the woods yet. I am still having problems reproducing the graph clustering results with data files other than the mysterious example_tn93.txt
one.
I'm going to go ahead and merge this PR. master branch is broken anyhow, so I am going to commit my ongoing work directly to that branch
Demonstrate a working example of graph-based cluster analysis using na.fasta example data and component.cluster function