PoonLab / clustuneR

Implementing clustering algorithms on genetic data and finding optimal parameters through the performance of predictive growth models.
GNU General Public License v3.0
0 stars 0 forks source link

Readme update with alternative graph example #14

Closed ConnorChato closed 9 months ago

ConnorChato commented 11 months ago

Demonstrate a working example of graph-based cluster analysis using na.fasta example data and component.cluster function

ConnorChato commented 11 months ago

Currently this does not match the results obtained with previous component clustering work. My current suspicion is that the re-written component-cluster function needs some attention.

ConnorChato commented 11 months ago

Total growth does not increase monotonically with component clustering threshold

(from example)

# generate cluster sets under varying parameter settings
param.list <- lapply(seq(0.001, 0.04, 0.001), function(x) {list(g=g, dist.thresh=x)})
cluster.sets <- multi.cluster(component.cluster, param.list) 

growth_total <- sapply(1:40, function(i){sum((cluster.sets[SetID == i])$Growth)}) 
all(growth_total == cummax(growth_total)) ## FALSE - SHOULD BE TRUE
ArtPoon commented 11 months ago

@ConnorChato can I ask @SandeepThokala to assist with this? i.e., by comparing the MountainPlot.R script from version v1.0 to the current component clustering script?

ConnorChato commented 11 months ago

Yes that would be wonderful - just got back from time away so there is a bit of a project queue.

ConnorChato commented 11 months ago

I've isolated the issue down to the clustering algorithm itself - I suspect that https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R would not produce the same cluster set as the compClu() function from https://github.com/PoonLab/tn/blob/master/comp_Lib.R or the MountainPlot.R code.

ArtPoon commented 11 months ago

I've isolated the issue down to the clustering algorithm itself - I suspect that https://github.com/PoonLab/clustuneR/blob/master/R/graph.clustering.R would not produce the same cluster set as the compClu() function from https://github.com/PoonLab/tn/blob/master/comp_Lib.R or the MountainPlot.R code.

@SandeepThokala can you please focus on this?

ConnorChato commented 11 months ago

I think at the time we were developing this, there was a focus on the tree-based clustering that left some of these algorithms a bit less tested. I am certain the problem is compClu() though, as it doesn't make sense for total cluster growth to be decreasing with an increasing threshold.

ArtPoon commented 11 months ago

@SandeepThokala we're going to have to walk through code changes starting from MountainPlot.R to graph.clustering.R (and associated scripts) to see where outputs change for the same test data.

ArtPoon commented 11 months ago

@SandeepThokala can you please record your work through the commit history in this PR

ArtPoon commented 11 months ago

@SandeepThokala determined that there was a major shift in the code base between commits fde8762 and 890c92e. It does not seem possible to run the example in the latter commit. My suggestion is to move forward from 890c92e until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.

ConnorChato commented 11 months ago

Hi @ArtPoon @SandeepThokala - I have a couple of starting fixes I did to correct the component clustering algorithm used to determine components in the new code here. There were a few different problems in the way it was written, so I can provide a bit of specificity from the investigation I've done in my off time.

ConnorChato commented 11 months ago

Here is an overcommented code chunk from the function that I've determined is misbehaving (component.cluster() from R/graph.clustering.R. Hopefully it can be helpful in some way

  # A clustering algorithm to determine disconnected components 
  # Uses a table of sequence metadata (g$seq.info) and a matrix of edges (filtered.edges).
  # If filtered.edges[i, j] is TRUE then there is an edge between node i and node j (which will both have a row in g$seq.info).

  # Assign a "Cluster" to each node. This is a unique number.
  # To start, each node is in its own, unique cluster id
  g$seq.info[, "Cluster" := 1:nrow(g$seq.info)]

  # Initialize a variable that tracks the previous cluster of each node. This is used as a halting condition.
  previous.cluster <- rep(0, nrow(g$seq.info))

  # (Tragically) Uses a while loop - I am hoping there is a clever way to vectorize this, but I haven't found it yet.
  # We will keep updating cluster memberships until things stop changing
  while (any(g$seq.info$Cluster != previous.cluster)) {

    # Update previous cluster info to current cluster info
    previous.cluster <- g$seq.info$Cluster

    # Reassign each node's cluster
    g$seq.info[, Cluster := sapply(1:nrow(g$seq.info), function(i) {

     # Look at the clusters of the nodes that node "i" is connected to
     # Node i will never be connected to itself.
      x <- g$seq.info[which(filtered.edges[i, ]), Cluster]
      if (length(x) == 0) {
        return(i) # If node i is not connected to any other nodes, it stays in it's current cluster
      } else {
        return(min(x)) # If connected to other nodes, node i's cluster becomes one of it's neighbour's clusters 
      }
    })]
  }

The issue

When figuring out WHICH neighbour's cluster node i should use, it decides on the cluster id that is the lowest number. I'm not exactly sure how best to resolve, but this logic can lead to infinite loops where neighbours trade their cluster id back and forth.

ConnorChato commented 11 months ago

Actually in re-describing that issue, I may have found a solution that generates some expected results.

(From Northern Alberta data) image

Since we are using a slightly different method of tn93 calculation in this code to simplify dependancies, I would still expect some slight differences between this and the original results, but I hope this is fairly similar to what you two find (@SandeepThokala @ArtPoon). Apologies if this duplicates or undoes any of your current work. There are likely much more elegant (and fast) solutions to this same issue that could replace this method anyway, I just wanted to finish up my revisit to this function.

SandeepThokala commented 11 months ago

The old function impTN93 creates a list of minimum retrospective edges minE by following the steps below:

Where as the new function create.graph calls minimum.retrosepctive.edge which just gets closest retrospective edges of only new vertices (not for all "not old"). https://github.com/PoonLab/clustuneR/blob/353d3fae1357de6eff597c61713e13e9465c6738/R/graph.setup.R#L54-L62

ConnorChato commented 11 months ago

Hi @SandeepThokala. The only point where it's truly necessary to find minimum retrospective edges is when looking at new cases linking to old cases. For the previous version, I calculated the minimum retrospective edge from every node because we were interested in sequentially re-defining the newest year, so any old year may eventually get called "New" and at that point we'd want it's minimum retrospective edge.

SandeepThokala commented 11 months ago

It does not seem possible to run the example in the latter commit. My suggestion is to move forward from 890c92e until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.

I tried to run tn93 on data/na.fasta to see if we're using the same input as example_tn93.txt tn93 -o data/test_tn93.txt data/na.fasta But the result was only 1308 edges long, where as example_tn93.txt has 1306536 edges.

ConnorChato commented 11 months ago

It does not seem possible to run the example in the latter commit. My suggestion is to move forward from 890c92e until it is possible to run the example, and then isolate the problem by (if possible) converting between the old and new data structures and comparing them to check whether the data input and pre-processing steps are equivalent.

I tried to run tn93 on data/na.fasta to see if we're using the same input as example_tn93.txt tn93 -o data/test_tn93.txt data/na.fasta But the result was only 1308 edges long, where as example_tn93.txt has 1306536 edges.

tn93 has a threshold option -t which I believe is 0.015. Typically when we run tn93 we would use an ultra permissive threshold to look at the maximum number of edges (something like tn93 -t 1 -o data/test_tn93.txt data/na.fasta)

SandeepThokala commented 11 months ago

Uploading my understanding of the functions.

The input file (example_tn93.txt) contains a huge list of comma separated edge info (ID1, ID2, distance).

impTN93

SandeepThokala commented 11 months ago

compAnalyze

ArtPoon commented 10 months ago

@SandeepThokala reports that he was able to convert the example data file into the data structure used by the most recent code, but this conversion script it is taking a long time to process. Can you post this script here so we can troubleshoot it?

SandeepThokala commented 10 months ago

Conversion script

import numpy as np
import pandas as pd
from tqdm import tqdm

ex = "./data/example_tn93.txt"

def get_distance(id1, id2):
  distance = df.loc[(df['ID1'] == id1) & (df['ID2'] == id2), 'Distance']
  if not(len(distance)):
    distance = df.loc[(df['ID2'] == id1) & (df['ID1'] == id2), 'Distance']
    if not(len(distance)):
      distance = np.nan

  if isinstance(distance, pd.core.series.Series):
    return distance.iloc[0]
  else:
    return distance

df = pd.read_csv(ex)

unique_ids = set(df['ID1'].unique())
unique_ids = unique_ids.union(set(df['ID2'].unique()))
unique_ids = np.array(list(unique_ids))

num_ids = len(unique_ids)
matrix = np.zeros((num_ids, num_ids))
matrix = np.reshape(np.append(unique_ids, matrix), (num_ids + 1, num_ids))

count = 0
for row_index, row in tqdm(enumerate(unique_ids, 1)):
    for col_index, col in enumerate(unique_ids[count:], count):
        if col != row:
            matrix[row_index, col_index] = get_distance(row, col)
        else:
            matrix[row_index, col_index] = 0

    count += 1

matrix = matrix + matrix.T

np.savetxt('example_matrix.csv', matrix, delimiter=',')
ConnorChato commented 10 months ago

@SandeepThokala reports that he was able to convert the example data file into the data structure used by the most recent code, but this conversion script it is taking a long time to process. Can you post this script here so we can troubleshoot it?

This first loop of component.cluster in R/graph.clustering.R is currently a while loop. I had hoped that it would generally limited in the number of steps that it takes, but yes it does seem to be really slow. Finding a way to vectorize this stage would likely be a big step forward.

SandeepThokala commented 10 months ago

After running the conversion script for almost 60 hours, example_matrix.csv is generated with 1617 rows and 1617 columns.

Reading in both old and new example data files.

require(clustuneR)
require(ape)
require(lubridate)
setwd(".")

# Reading in new example file ("data/na.fasta") and generating edge info using "TN93" model
seqs <- ape::read.FASTA("data/na.fasta", type="DNA")
new.df <- ape::dist.dna(seqs, pairwise.deletion = T, as.matrix = T, model = "TN93")

# Reading in edge info generated from old example file ("data/example_tn93.txt") using conversion script
ex.csv <- "./data/example_matrix.csv"

old.df <- read.csv(ex.csv)
rownames(old.df) <- colnames(old.df)

Comparing dimensions of both dataframes.

> dim(old.df)
[1] 1617 1617
> dim(new.df)
[1] 1054 1054
SandeepThokala commented 10 months ago

Creating intermediate data structures used in clustuneR

The conversion script above created edge.info list from example_tn93.txt

import pandas as pd
import numpy as np
from datetime import datetime as dt

old_txt = './data/example_tn93.txt'

def get_seq_info(old_txt):
    df = pd.read_csv(old_txt)
    unique_ids = set(df['ID1'].unique())
    unique_ids = unique_ids.union(set(df['ID2'].unique()))
    unique_ids = np.array(list(unique_ids))

    seq_info = {'accession': list(map(lambda x: x.split('_')[0],
                                      unique_ids)),
                'coldate': list(map(lambda x: dt.strftime(dt.strptime(x.split('_')[1], '%Y'),
                                                          '%Y-%m-%d'),
                                    unique_ids)),
                'subtype': list(map(lambda x: np.random.choice(['subA', 'subB', 'subC']),
                                    unique_ids))} # Assigning random subtypes

    seq_info = pd.DataFrame(seq_info)
    seq_info['Header'] = seq_info.apply(lambda x: f"{x['accession']}_{x['coldate']}_{x['subtype']}", axis=1)

    seq_info.to_csv('./data/seq_info.csv', index=False)

get_seq_info(old_txt)

Used another conversion script to create and save seq.info list from example_tn93.txt

SandeepThokala commented 10 months ago

Parsing intermediate data structures for graph creatiion

# replacing usage of pull.headers
seq.info <- read.csv("./data/seq_info.csv")
max.year <- max(year(seq.info$coldate))
which.new <- which(year(seq.info$coldate) == max.year)
# replacing usage of ape::dist.dna for calculating genetic distance using TN93 model
edge.info <- read.csv("./data/edge_info.csv")
rownames(edge.info) <- colnames(edge.info)
g <- create.graph(seq.info, edge.info, which.new)

Resulted in error

Error in create.graph(seq.info, edge.info, which.new) : 
  The pairwise distance matrix does not contain all recognized headers 
         from alignment

This input validation in graph.setup.R script caused the error

  if (!all(colnames(edge.info) %in% seq.info$Header)) {
    stop("The pairwise distance matrix does not contain all recognized headers 
         from alignment")
  }
SandeepThokala commented 10 months ago

Solving the errors by making use of pull.headers function

seq.info <- read.csv("./data/seq_info.csv")

seqs <- list()
for (header in seq.info$Header) {
  seqs[[header]] <- header
}

seq.info <- pull.headers(seqs, sep="_", var.names=c('accession', 'coldate', 'subtype'),
                         var.transformations=c(as.character, as.Date, as.factor))
max.year <- max(year(seq.info$coldate))
which.new <- which(year(seq.info$coldate) == max.year)

edge.info <- read.csv("./data/edge_info.csv")
colnames(edge.info) <- lapply(colnames(edge.info), function(x) {
     seq.info$Header[seq.info$accession == strsplit(x, "_")[[1]][1]]
})
rownames(edge.info) <- colnames(edge.info)
g <- create.graph(seq.info, edge.info, which.new)

Finally able to create graph data structure using above script

ArtPoon commented 9 months ago

I've refactored a lot of the code and merged my work from iss12 branch into this one, but I don't think we're out of the woods yet. I am still having problems reproducing the graph clustering results with data files other than the mysterious example_tn93.txt one.

ArtPoon commented 9 months ago

I'm going to go ahead and merge this PR. master branch is broken anyhow, so I am going to commit my ongoing work directly to that branch