Open flomato opened 2 years ago
Hi, I think I already saw this problem. I think it is solved in my current fork https://github.com/i-cyto/Rphenograph or in Jan's fork https://github.com/stuchly/Rphenoannoy. Could you test one or the other? This will be quicker than pointing the exact solution. IMHO k=10 is too low, 30 or 50 are better in my hands. Best.
@flomato and @SamGG, the problem occurs quite frequently and it the result of there being neighbourhoods in the kNN graph that do not overlap with any other neighbourhoods (i.e. jaccard index = 0).
As a result there are nodes that don't make their way into the edgelist for graph construction and are therefore excluded from clustering too.
The solution is to set an outlier label for excluded nodes and make sure that the order of the labels matches that of the nodes in the graph.
See python docs for details: https://pypi.org/project/PhenoGraph/
library(Rphenograph)
library(igraph)
res <- Rphenograph(iris, k = 5)
labels <- rep(0, nrow(iris))
labels[as.numeric(V(res[[1]])$name))] <- res[[2]]$membership
@DillonHammill I think I solved this in my code https://github.com/i-cyto/Rphenograph/blob/86ceedeb2c41ecb98b1b2a7c980d91f66fa62e6c/R/phenograph.R#L154. I saw it a few times, but didn't remember the dataset. Do you have a dataset to check this problem?
Try running iris dataset with k=3
.
res <- Rphenograph(iris[, 1:4], k = 3)
length(res[[2]]$membership)
nrow(iris)
The line you point to doesn't actually do anything. links
is an edgelist with columns from
, to
and weight
- so that code is actually excluding from
(first column) with values less than or equal to zero. Changing it to links <- links[links[,3]>0, ]
won't help either as those edges don't exist in the edgelist matrix returned by jaccard_coef()
.
As you can see in the source code for jaccard_coef()
, edges with jaccard_index = 0
are excluded from the edgelist:
https://github.com/JinmiaoChenLab/Rphenograph/blob/0298487f0ee13aac55eb77d19992f6bd878ba2fc/src/jaccard_coeff.cpp#L26
You could probably just make those nodes go from
and to
themselves with zero weight and perhaps then they will be included in the graph and clustering output. Alternatively, the approach I mentioned above works in a similar way to that proposed in the python version of phenograph.
This issue has been reported and solved by EK Becht https://github.com/JinmiaoChenLab/Rphenograph/pull/8 .I used it in my code at https://github.com/i-cyto/Rphenograph/blob/86ceedeb2c41ecb98b1b2a7c980d91f66fa62e6c/src/jaccard_coeff.cpp#L217-L236 I think the following code might show that it works. Of course, I use my fork of Rphenograph. I think Ian's code is also safe. Thanks for your devs, Samuel
data("iris")
library(RANN)
library(igraph)
#> Warning: package 'igraph' was built under R version 4.0.5
#>
#> Attaching package: 'igraph'
#> The following objects are masked from 'package:stats':
#>
#> decompose, spectrum
#> The following object is masked from 'package:base':
#>
#> union
dt = iris[,-5]
K = 3
set.seed(0)
# compute nearest neighbors
neighborMatrix <- nn2(dt, k=K+1)[[1]][,-1]
# some data points failed to be reported by original Jaccard index calculation
links <- Rphenograph:::jaccard_coeff(neighborMatrix)
links <- links[links[,1]>0, ]
length(sort(unique(c(links[,1], links[,2]))))
#> [1] 144
# this is not the case in the improved implementations
links <- Rphenograph:::jaccard_coeff_4(neighborMatrix, K)
links <- links[links[,1]>0, ]
length(sort(unique(c(links[,1], links[,2]))))
#> [1] 150
relations <- as.data.frame(links)
colnames(relations)<- c("from","to","weight")
g <- graph.data.frame(relations, directed=FALSE)
# Louvain and Leiden clustering
comm_louv <- cluster_louvain(g)
comm_leid <- cluster_leiden(g, resolution_parameter = .01)
# this resolution leads to a similar amount of clusters
length(unique(membership(comm_leid)))
#> [1] 65
length(unique(membership(comm_louv)))
#> [1] 61
# both report the correct count of membership
length(membership(comm_leid))
#> [1] 150
length(membership(comm_louv))
#> [1] 150
# compare both, for fun
compare(comm_louv, comm_leid, method = "adj")
#> [1] 0.8104326
# range values for the criterium
compare(comm_louv, comm_louv, method = "adj")
#> [1] 1
compare(membership(comm_louv), membership(comm_louv), method = "adj")
#> [1] 1
compare(membership(comm_louv)[sample.int(nrow(dt))], membership(comm_louv), method = "adj")
#> [1] 0.01677979
Created on 2022-09-12 by the reprex package (v2.0.1)
Thanks @SamGG, any reason why this hasn't been fixed on this repo? Are you able to get your changes merged?
Double no. No answer at https://github.com/JinmiaoChenLab/Rphenograph/issues/12 and look at PRs... If you feel lucky, try to contact the PI.
Trying to run the Rphenograph function; it seems to remove one entry.
Run Rphenograph starts: -Input data of 366047 rows and 8 columns -k is set to 10
But the output matrix has 366046 rows.
Wondering if someone may have insights to this please?
Thanks!