JinmiaoChenLab / Rphenograph

Rphenograph: R implementation of the PhenoGraph algorithm
47 stars 26 forks source link

Rphenograph data output issue? #18

Open flomato opened 2 years ago

flomato commented 2 years ago

Trying to run the Rphenograph function; it seems to remove one entry.

Run Rphenograph starts: -Input data of 366047 rows and 8 columns -k is set to 10

But the output matrix has 366046 rows.

Wondering if someone may have insights to this please?

Thanks!

SamGG commented 2 years ago

Hi, I think I already saw this problem. I think it is solved in my current fork https://github.com/i-cyto/Rphenograph or in Jan's fork https://github.com/stuchly/Rphenoannoy. Could you test one or the other? This will be quicker than pointing the exact solution. IMHO k=10 is too low, 30 or 50 are better in my hands. Best.

DillonHammill commented 2 years ago

@flomato and @SamGG, the problem occurs quite frequently and it the result of there being neighbourhoods in the kNN graph that do not overlap with any other neighbourhoods (i.e. jaccard index = 0).

As a result there are nodes that don't make their way into the edgelist for graph construction and are therefore excluded from clustering too.

The solution is to set an outlier label for excluded nodes and make sure that the order of the labels matches that of the nodes in the graph.

See python docs for details: https://pypi.org/project/PhenoGraph/

library(Rphenograph)
library(igraph)

res <- Rphenograph(iris, k = 5)
labels <- rep(0, nrow(iris))
labels[as.numeric(V(res[[1]])$name))] <- res[[2]]$membership
SamGG commented 2 years ago

@DillonHammill I think I solved this in my code https://github.com/i-cyto/Rphenograph/blob/86ceedeb2c41ecb98b1b2a7c980d91f66fa62e6c/R/phenograph.R#L154. I saw it a few times, but didn't remember the dataset. Do you have a dataset to check this problem?

DillonHammill commented 2 years ago

Try running iris dataset with k=3.

res <- Rphenograph(iris[, 1:4], k = 3)
length(res[[2]]$membership)
nrow(iris)

The line you point to doesn't actually do anything. links is an edgelist with columns from, to and weight - so that code is actually excluding from (first column) with values less than or equal to zero. Changing it to links <- links[links[,3]>0, ] won't help either as those edges don't exist in the edgelist matrix returned by jaccard_coef().

As you can see in the source code for jaccard_coef(), edges with jaccard_index = 0 are excluded from the edgelist: https://github.com/JinmiaoChenLab/Rphenograph/blob/0298487f0ee13aac55eb77d19992f6bd878ba2fc/src/jaccard_coeff.cpp#L26

You could probably just make those nodes go from and to themselves with zero weight and perhaps then they will be included in the graph and clustering output. Alternatively, the approach I mentioned above works in a similar way to that proposed in the python version of phenograph.

SamGG commented 2 years ago

This issue has been reported and solved by EK Becht https://github.com/JinmiaoChenLab/Rphenograph/pull/8 .I used it in my code at https://github.com/i-cyto/Rphenograph/blob/86ceedeb2c41ecb98b1b2a7c980d91f66fa62e6c/src/jaccard_coeff.cpp#L217-L236 I think the following code might show that it works. Of course, I use my fork of Rphenograph. I think Ian's code is also safe. Thanks for your devs, Samuel

data("iris")

library(RANN)
library(igraph)
#> Warning: package 'igraph' was built under R version 4.0.5
#> 
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#> 
#>     decompose, spectrum
#> The following object is masked from 'package:base':
#> 
#>     union

dt = iris[,-5]
K = 3
set.seed(0)

# compute nearest neighbors
neighborMatrix <- nn2(dt, k=K+1)[[1]][,-1]

# some data points failed to be reported by original Jaccard index calculation
links <- Rphenograph:::jaccard_coeff(neighborMatrix)
links <- links[links[,1]>0, ]
length(sort(unique(c(links[,1], links[,2]))))
#> [1] 144

# this is not the case in the improved implementations
links <- Rphenograph:::jaccard_coeff_4(neighborMatrix, K)
links <- links[links[,1]>0, ]
length(sort(unique(c(links[,1], links[,2]))))
#> [1] 150

relations <- as.data.frame(links)
colnames(relations)<- c("from","to","weight")
g <- graph.data.frame(relations, directed=FALSE)

# Louvain and Leiden clustering
comm_louv <- cluster_louvain(g)
comm_leid <- cluster_leiden(g, resolution_parameter = .01)
# this resolution leads to a similar amount of clusters
length(unique(membership(comm_leid)))
#> [1] 65
length(unique(membership(comm_louv)))
#> [1] 61

# both report the correct count of membership
length(membership(comm_leid))
#> [1] 150
length(membership(comm_louv))
#> [1] 150

# compare both, for fun
compare(comm_louv, comm_leid, method = "adj")
#> [1] 0.8104326
# range values for the criterium
compare(comm_louv, comm_louv, method = "adj")
#> [1] 1
compare(membership(comm_louv), membership(comm_louv), method = "adj")
#> [1] 1
compare(membership(comm_louv)[sample.int(nrow(dt))], membership(comm_louv), method = "adj")
#> [1] 0.01677979

Created on 2022-09-12 by the reprex package (v2.0.1)

DillonHammill commented 2 years ago

Thanks @SamGG, any reason why this hasn't been fixed on this repo? Are you able to get your changes merged?

SamGG commented 2 years ago

Double no. No answer at https://github.com/JinmiaoChenLab/Rphenograph/issues/12 and look at PRs... If you feel lucky, try to contact the PI.