gagolews / genieclust

Genie: Fast and Robust Hierarchical Clustering with Noise Point Detection - in Python and R
https://genieclust.gagolewski.com
Other
58 stars 10 forks source link

mst() problem and emst_mlpack producing a Fatal error in R. #76

Closed estevezdo closed 1 year ago

estevezdo commented 1 year ago

I am having issues with implementing gclust() in my data. I get the following error: Error in .mst.default(d, distance, M, cast_float32, verbose) : genieclust: Assertion std::isfinite(Dnn[bestj]) failed in ./c_mst.h:489

when using the verbose = TRUE argument this are the details I get: [genieclust] Computing the MST. [genieclust] Computing the MST... 99%Error in .mst.default(d, distance, M, cast_float32, verbose) : genieclust: Assertion std::isfinite(Dnn[bestj]) failed in ./c_mst.h:489

I am attaching the matrix I am using (with the first column used to name the rows on my matrix)

data.csv

gagolews commented 1 year ago

Well, this is a very "interesting" dataset, because it also crashes MLPACK O_O

> X <- read.csv("https://github.com/gagolews/genieclust/files/10005060/data.csv")
X <- as.matrix(X[, -1])
mlpack::emst(X)$output
Segmentation fault (core dumped)

I will keep trying to find out what's wrong with it. Meanwhile, the following seems to work:

h <- genieclust::gclust(dist(X))
print(h)
gagolews commented 1 year ago

(note to self: genieclust::mst.dist is correct)

set.seed(123)
X <- read.csv("https://github.com/gagolews/genieclust/files/10005060/data.csv")
X <- as.matrix(X[, -1])

stopifnot(abs(
    genieclust::gclust(dist(X), gini_threshold=1.0)$height 
    - 
    fastcluster::hclust.vector(X, "single")$height
) < 1e-12)

## OK
gagolews commented 1 year ago

Mystery solved!

X features missing values, and it should not.

> arrayInd(which(is.na(X)), dim(X))
      [,1] [,2]
 [1,]   58    2
 [2,]   58    5
 [3,]   59    5
 [4,]   58    6
 [5,]   59    7
 [6,]   58    9
 [7,]   58   10
 [8,]   62   10
 [9,]   58   12
[10,]   62   12
[11,]   71   12
[12,]   58   14
[13,]   62   14
[14,]   49   16
[15,]   58   16
[16,]   62   16
[17,]   58   17
[18,]   58   20
[19,]   59   20

I will patch the method so that it throws an error if there are missing values in data.

estevezdo commented 1 year ago

Thanks. This was very useful. BTW, might be helpful to feature integration with popular heatmap packages out there such as complex heatmaps. The only thing that needs to be explained is that most of these packages will do clustering in rows and columns so the column portion of the heatmap needs to be a transposed version of the matrix. For example in the complex heatmaps package the function for each of the options needs to be set as follows: cluster_rows = gclust(M) cluster_columns = gclust(t(M)) m - Some Matrix.

Example syntax: Heatmap(M, cluster_rows = gclust(M), cluster_columns = gclust(t(M)))

gagolews commented 1 year ago

Nice use case, thanks!

Still, as a fan of minimalism, I'd rather refrain from introducing such a functionality separately - it can be easily obtained manually (at the cost of an additional call to the built-in transpose; as you've kindly shown above).