elbamos / largeVis

An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R
340 stars 62 forks source link

hdbscan-non-numeric argument to binary operator #44

Closed feng-1985 closed 7 years ago

feng-1985 commented 7 years ago

library(largeVis) set.seed(123) ts_matrix_elec <- elect_data %>% scale() %>% t() visObject <- largeVis(ts_matrix_elec, n_trees = 50, K = 10) plot(t(visObject$coords))

clusters <- hdbscan(visObject, verbose = FALSE) # failed Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs - : non-numeric argument to binary operator

gplot(clusters, t(visObject$coords))

What happened? Is there any suggestion?

elbamos commented 7 years ago

Hey can you make elect_data available and I'll look? thanks.

feng-1985 commented 7 years ago

Thank you! and how to make the data avaliable for you?

elbamos commented 7 years ago

If you could Dropbox it to me today, I'm planning a bug fix update soon.

On Apr 2, 2017, at 9:50 AM, hu bifeng notifications@github.com wrote:

Thank you! and how to make the data avaliable for you?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

feng-1985 commented 7 years ago

https://www.dropbox.com/s/0v41q45yvn9ahzh/test_gm_data.csv?dl=0 I upload the data to Dropbox, thank you!

It's time series from January 2012 to December 2015 monthly data, there are 11524 individuals, i want to cluster these individuals based on the time dimension.

elbamos commented 7 years ago

I can't reproduce it. Can you try the version currently in branch hotfix/twobugs and confirm if the issue is now resolved?

Loading required package: Matrix
> library(readr)
> elect_data <- read_csv("~/Downloads/test_gm_data.csv")
Parsed with column specification:
cols(
  .default = col_double()
)
See spec(...) for full column specifications.
> str(elect_data)
<snip>
> library(largeVis)
Loading required package: Rcpp
> library(magrittr)
> ts_matrix <- elect_data %>% scale() %>% t()
> visObj <- largeVis(ts_matrix, n_trees = 50, K = 10, verbose = TRUE)
Searching for neighbors.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating edge weights...
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Estimating embeddings.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Warning message:
In largeVis(ts_matrix, n_trees = 50, K = 10, verbose = TRUE) :
  The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.
> plot(t(visObj$coords))
> clusters <- hdbscan(visObj, verbose = TRUE)
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
> gplot(clusters, t(visObj$coords))
Warning message:
Removed 1337 rows containing missing values (geom_segment). 
> 
feng-1985 commented 7 years ago

image

image

how to use that version?

elbamos commented 7 years ago

devtools::install_github("elbamos/largeVis", ref = "hotfix/twobugs")

feng-1985 commented 7 years ago

image

too slow, is there any faster download method?

elbamos commented 7 years ago

No.  It takes a fraction of a second from here.  

On April 5, 2017 at 2:21:12 AM, hu bifeng (notifications@github.com) wrote:

too slow, is there any faster download method?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

feng-1985 commented 7 years ago

Yes, it works!

But why there is "NA" in the plot? I can't upload this image, did you see that in you plot?

elbamos commented 7 years ago

Yes. Points will have cluster NA if the algorithm does not put them in a cluster. You can review the documentation on the algorithm for detail if you'd like.

I'm going to close this now - feel free to reopen if anything comes up.

feng-1985 commented 7 years ago

Thank you!

elbamos commented 7 years ago

@bifeng There was a bug in the version of largeVis that you tested a week ago. The bug caused the hdbscan algorithm to fail to combine clusters that should be combined. If you try the version that I've just pushed, it should produce better results on your dataset.

clin045 commented 7 years ago

I am also encountering the same problem.

> load('C:/lab/normdata.Rdata')

> library(largeVis)

> library(ggplot2)

> norm <- scale(norm)

> l <- largeVis(norm,verbose=T)
Searching for neighbors.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating edge weights...
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Estimating embeddings.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|

> clusters <- largeVis::hdbscan(l,verbose=T)
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
********Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs -  : 
  non-numeric argument to binary operator
In addition: Warning message:
In largeVis(norm, verbose = T) :
  The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.
elbamos commented 7 years ago

Can you make your data available to me and I'll take a look tonight? I was sure I fixed this.

On Jun 29, 2017, at 5:01 PM, Christopher Lin notifications@github.com wrote:

I am also encountering the same problem.

load('C:/lab/normdata.Rdata')

library(largeVis)

library(ggplot2)

norm <- scale(norm)

l <- largeVis(norm,verbose=T) Searching for neighbors. 0% 10 20 30 40 50 60 70 80 90 100% ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ** Calculating edge weights... 0% 10 20 30 40 50 60 70 80 90 100%
**

Estimating embeddings. 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| **|

clusters <- largeVis::hdbscan(l,verbose=T) 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| **| ****Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs - : non-numeric argument to binary operator In addition: Warning message: In largeVis(norm, verbose = T) : The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

clin045 commented 7 years ago

I'd rather not publicly post the data. Can I email it to you?

elbamos commented 7 years ago

Sure or email me a Dropbox link. My email is in the git

On Jun 29, 2017, at 5:08 PM, Christopher Lin notifications@github.com wrote:

I'd rather not publicly post the data. Can I email it to you?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

elbamos commented 7 years ago

I couldn't reproduce it. Are you sure you're using a current version?

screen shot 2017-06-30 at 1 12 48 pm screen shot 2017-06-30 at 1 12 57 pm

clin045 commented 7 years ago

I installed it with

devtools::install_github("elbamos/largeVis", ref = "hotfix/twobugs")

Is this correct?

elbamos commented 7 years ago

No - the hotfix was rolled in ages ago. Just install from master. Leave out the "ref" parameter.

On Jun 30, 2017, at 2:05 PM, Christopher Lin notifications@github.com wrote:

I installed it with

devtools::install_github("elbamos/largeVis", ref = "hotfix/twobugs") Is this correct?

— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.

clin045 commented 7 years ago

I've reinstalled from master and it's still throwing the same error.

> h <- hdbscan(vis, verbose=T)
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
********Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs -  : 
  non-numeric argument to binary operator
In addition: Warning message:
In largeVis(norm, verbose = T) :
  The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.
elbamos commented 7 years ago

That's very odd. Can you send me the log or a screenshot of a complete session? Start from an empty environment, load largeVis, check the version, and try the commands in just the way I did them?

clin045 commented 7 years ago
> load('C:/lab/normdata.Rdata')

> library(largeVis)

> library(ggplot2)

> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)

Matrix products: default

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] ggplot2_2.2.1  largeVis_0.2.2 Matrix_1.2-9  

loaded via a namespace (and not attached):
 [1] colorspace_1.3-2 scales_0.4.1     compiler_3.4.0   lazyeval_0.2.0   plyr_1.8.4      
 [6] tools_3.4.0      gtable_0.2.0     tibble_1.3.3     Rcpp_0.12.11     grid_3.4.0      
[11] rlang_0.1.1      munsell_0.4.3    lattice_0.20-35 

> norm <- scale(norm)

> vis <- largeVis(norm,verbose=T)
Searching for neighbors.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating edge weights...
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Estimating embeddings.
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|

> plot(t(vis$coords))

> h <- hdbscan(vis, verbose=T)
0%   10   20   30   40   50   60   70   80   90   100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
********Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs -  : 
  non-numeric argument to binary operator
In addition: Warning message:
In largeVis(norm, verbose = T) :
  The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.
elbamos commented 7 years ago

Thanks, I was able to reproduce this. 

The error is coming up because hdbscan can’t cluster this data.  

This is because it has a huge number of duplicate points in it. 

When the number of duplicates grows, then both largeVis and hdbscan become undefined since they depend on finding each point’s n-nearest neighbors. 

You can try to force a clustering by adjusting minPts and K (down).  But really I think the question you want to ask is whether you want to de-dupe this data before you try to cluster and visualize it and, if you don’t want to de-dupe it, whether using a nearest-neighbor algorithm makes sense? 

I will include a check for this in the next largeVis version.

Thanks again for reporting!

On July 7, 2017 at 2:50:09 PM, Christopher Lin (notifications@github.com) wrote:

h <- hdbscan(vis, verbose=T)