Closed feng-1985 closed 7 years ago
Hey can you make elect_data
available and I'll look? thanks.
Thank you! and how to make the data avaliable for you?
If you could Dropbox it to me today, I'm planning a bug fix update soon.
On Apr 2, 2017, at 9:50 AM, hu bifeng notifications@github.com wrote:
Thank you! and how to make the data avaliable for you?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
https://www.dropbox.com/s/0v41q45yvn9ahzh/test_gm_data.csv?dl=0 I upload the data to Dropbox, thank you!
It's time series from January 2012 to December 2015 monthly data, there are 11524 individuals, i want to cluster these individuals based on the time dimension.
I can't reproduce it. Can you try the version currently in branch hotfix/twobugs
and confirm if the issue is now resolved?
Loading required package: Matrix
> library(readr)
> elect_data <- read_csv("~/Downloads/test_gm_data.csv")
Parsed with column specification:
cols(
.default = col_double()
)
See spec(...) for full column specifications.
> str(elect_data)
<snip>
> library(largeVis)
Loading required package: Rcpp
> library(magrittr)
> ts_matrix <- elect_data %>% scale() %>% t()
> visObj <- largeVis(ts_matrix, n_trees = 50, K = 10, verbose = TRUE)
Searching for neighbors.
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating edge weights...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Estimating embeddings.
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Warning message:
In largeVis(ts_matrix, n_trees = 50, K = 10, verbose = TRUE) :
The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.
> plot(t(visObj$coords))
> clusters <- hdbscan(visObj, verbose = TRUE)
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
> gplot(clusters, t(visObj$coords))
Warning message:
Removed 1337 rows containing missing values (geom_segment).
>
how to use that version?
devtools::install_github("elbamos/largeVis", ref = "hotfix/twobugs")
too slow, is there any faster download method?
No. It takes a fraction of a second from here.
On April 5, 2017 at 2:21:12 AM, hu bifeng (notifications@github.com) wrote:
too slow, is there any faster download method?
— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.
Yes, it works!
But why there is "NA" in the plot? I can't upload this image, did you see that in you plot?
Yes. Points will have cluster NA
if the algorithm does not put them in a cluster. You can review the documentation on the algorithm for detail if you'd like.
I'm going to close this now - feel free to reopen if anything comes up.
Thank you!
@bifeng There was a bug in the version of largeVis that you tested a week ago. The bug caused the hdbscan
algorithm to fail to combine clusters that should be combined. If you try the version that I've just pushed, it should produce better results on your dataset.
I am also encountering the same problem.
> load('C:/lab/normdata.Rdata')
> library(largeVis)
> library(ggplot2)
> norm <- scale(norm)
> l <- largeVis(norm,verbose=T)
Searching for neighbors.
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating edge weights...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Estimating embeddings.
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
> clusters <- largeVis::hdbscan(l,verbose=T)
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
********Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs - :
non-numeric argument to binary operator
In addition: Warning message:
In largeVis(norm, verbose = T) :
The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.
Can you make your data available to me and I'll take a look tonight? I was sure I fixed this.
On Jun 29, 2017, at 5:01 PM, Christopher Lin notifications@github.com wrote:
I am also encountering the same problem.
load('C:/lab/normdata.Rdata')
library(largeVis)
library(ggplot2)
norm <- scale(norm)
l <- largeVis(norm,verbose=T) Searching for neighbors. 0% 10 20 30 40 50 60 70 80 90 100% ---- ---- ---- ---- ---- ---- ---- ---- ---- ---- ** Calculating edge weights... 0% 10 20 30 40 50 60 70 80 90 100% ** Estimating embeddings. 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| **|
clusters <- largeVis::hdbscan(l,verbose=T) 0% 10 20 30 40 50 60 70 80 90 100% |----|----|----|----|----|----|----|----|----|----| **| ****Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs - : non-numeric argument to binary operator In addition: Warning message: In largeVis(norm, verbose = T) : The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector. — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.
I'd rather not publicly post the data. Can I email it to you?
Sure or email me a Dropbox link. My email is in the git
On Jun 29, 2017, at 5:08 PM, Christopher Lin notifications@github.com wrote:
I'd rather not publicly post the data. Can I email it to you?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.
I couldn't reproduce it. Are you sure you're using a current version?
I installed it with
devtools::install_github("elbamos/largeVis", ref = "hotfix/twobugs")
Is this correct?
No - the hotfix was rolled in ages ago. Just install from master. Leave out the "ref" parameter.
On Jun 30, 2017, at 2:05 PM, Christopher Lin notifications@github.com wrote:
I installed it with
devtools::install_github("elbamos/largeVis", ref = "hotfix/twobugs") Is this correct?
— You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub, or mute the thread.
I've reinstalled from master and it's still throwing the same error.
> h <- hdbscan(vis, verbose=T)
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
********Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs - :
non-numeric argument to binary operator
In addition: Warning message:
In largeVis(norm, verbose = T) :
The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.
That's very odd. Can you send me the log or a screenshot of a complete session? Start from an empty environment, load largeVis, check the version, and try the commands in just the way I did them?
> load('C:/lab/normdata.Rdata')
> library(largeVis)
> library(ggplot2)
> sessionInfo()
R version 3.4.0 (2017-04-21)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows >= 8 x64 (build 9200)
Matrix products: default
locale:
[1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C
[5] LC_TIME=English_United States.1252
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] ggplot2_2.2.1 largeVis_0.2.2 Matrix_1.2-9
loaded via a namespace (and not attached):
[1] colorspace_1.3-2 scales_0.4.1 compiler_3.4.0 lazyeval_0.2.0 plyr_1.8.4
[6] tools_3.4.0 gtable_0.2.0 tibble_1.3.3 Rcpp_0.12.11 grid_3.4.0
[11] rlang_0.1.1 munsell_0.4.3 lattice_0.20-35
> norm <- scale(norm)
> vis <- largeVis(norm,verbose=T)
Searching for neighbors.
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Calculating edge weights...
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
Estimating embeddings.
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
> plot(t(vis$coords))
> h <- hdbscan(vis, verbose=T)
0% 10 20 30 40 50 60 70 80 90 100%
|----|----|----|----|----|----|----|----|----|----|
**************************************************|
********Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs - :
non-numeric argument to binary operator
In addition: Warning message:
In largeVis(norm, verbose = T) :
The Distances between some neighbors are large enough to cause the calculation of p_{j|i} to overflow. Scaling the distance vector.
Thanks, I was able to reproduce this.
The error is coming up because hdbscan can’t cluster this data.
This is because it has a huge number of duplicate points in it.
When the number of duplicates grows, then both largeVis and hdbscan become undefined since they depend on finding each point’s n-nearest neighbors.
You can try to force a clustering by adjusting minPts and K (down). But really I think the question you want to ask is whether you want to de-dupe this data before you try to cluster and visualize it and, if you don’t want to de-dupe it, whether using a nearest-neighbor algorithm makes sense?
I will include a check for this in the next largeVis version.
Thanks again for reporting!
On July 7, 2017 at 2:50:09 PM, Christopher Lin (notifications@github.com) wrote:
h <- hdbscan(vis, verbose=T)
library(largeVis) set.seed(123) ts_matrix_elec <- elect_data %>% scale() %>% t() visObject <- largeVis(ts_matrix_elec, n_trees = 50, K = 10) plot(t(visObject$coords))
clusters <- hdbscan(visObject, verbose = FALSE) # failed Error in stats::aggregate(probs, by = list(clusters), FUN = "max")$probs - : non-numeric argument to binary operator
gplot(clusters, t(visObject$coords))
What happened? Is there any suggestion?