elbamos / largeVis

An implementation of the largeVis algorithm for visualizing large, high-dimensional datasets, for R
340 stars 63 forks source link

randomProjectionTreeSearch fails on division by zero with SparseMatrix #35

Closed amodig closed 7 years ago

amodig commented 7 years ago

Hello,

There's an error which I cannot traceback. First, I create a sparse data matrix simply

data <- read_csv("my_data.csv", col_names = FALSE)
data <- as.matrix(data)
data <- t(data)
data <- Matrix(data, sparse = TRUE)

When I give the sparse matrix to randomProjectionTreeSearch, the R session ends with the following message:

terminate called after throwing an instance of 'std::logic_error'
  what():  element-wise division: division by zero

My data set has 50k samples and 5 dimensions, so I was still able to run it without using a sparse matrix following the guideline in the vignette, but this could bring a memory issue in future (my workstation has only 16GB memory).

Thank you for the excellent package!

elbamos commented 7 years ago

Thanks for your report.

Can I ask you to do a couple of things? First, try the version that was accepted by cran today, which is the same as the last 0.1.10rc on here. Let us know if that works. Second, if you're using cosine as your distance measure, can you try with Euclidean? Third, can you post the code and dataset?

On Nov 14, 2016, at 12:47 PM, Arttu Modig notifications@github.com wrote:

Hello,

There's an error which I cannot traceback. First, I create a sparse data matrix simply

data <- read_csv("my_data.csv", col_names = FALSE) data <- as.matrix(data) data <- t(data) data <- Matrix(data, sparse = TRUE) When I give the sparse matrix to randomProjectionTreeSearch, the R session ends with the following message:

terminate called after throwing an instance of 'std::logic_error' what(): element-wise division: division by zero My data set has 50k samples and 5 dimensions, so I was still able to run it without using a sparse matrix following the guideline in the vignette, but this could bring a memory issue in future (my workstation has only 16GB memory).

Thank you for the excellent package!

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

amodig commented 7 years ago

It failed also with the CRAN version. I've used Euclidean distances.

The data is here (available for a while): https://dl.dropboxusercontent.com/u/12007410/debug/data_scaled.csv

I made a quick abridged version of my full code (including all the libraries I use).

library(tidyverse)
library(Matrix)
library(largeVis)
library(hexbin)
library(tictoc)

data <- read_csv("data_scaled.csv", col_names = FALSE)
data <- as.matrix(data)
cat("Samples:", dim(data)[1], ", features:", dim(data)[2], "\n")

# duplicates can cause the algorithm to fail
cat("Removing duplicates...\n")
dupes = which(duplicated(data))
data <- data[-dupes,]
cat("Removed ", length(dupes), "duplicates.\n")

# sparsify and turn (input features are rows and examples are columns)
data <- t(data)
data <- Matrix(data, sparse = TRUE)

cat("Computing largeVis...\n")
max_iter <- 5
k <- 50

tic(paste0("K = ", k))
cat(paste0("LargeVis with k = ", k, ", max_iter = ", max_iter, "\n"))
cat("Random projection tree search...\n")
neighbors <- randomProjectionTreeSearch(data, max_iter = max_iter, K = k)  # fails on division by zero with sparse matrix
cat("Building edge matrix...\n")
edges <- buildEdgeMatrix(data = data, neighbors = neighbors)
# save edge information
# cat("Saving...\n")
# saveRDS(neighbors, paste0("neighbors_k", k, "i", max_iter, ".rds"))
# saveRDS(edges, paste0("edges_k", k, "i", max_iter, ".rds"))
# free memory
rm(neighbors)
gc()
wij <- buildWijMatrix(edges)
rm(edges)
gc()
cat("Project KNNs...\n")
coords <- projectKNNs(wij)
toc()

cat("Write and plot...\n")
# write.table(t(coords), paste0("coords_k", k, "i", max_iter, ".csv"), sep=",", row.names = FALSE, col.names = FALSE)
df <- data.frame(x = t(coords)[,1], y = t(coords)[,2])
ggplot(df, aes(x=x, y=y)) + geom_point(alpha = 0.2, size = 1) + ggtitle(paste0("max_iter = ", max_iter, ", K = ", k))
# ggsave(paste0("coords_k", k, "i", max_iter, ".png"))
cat("Done.\n")
elbamos commented 7 years ago

Ok I'm able to reproduce this. Thanks for reporting, I'll take a look and try to fix this weekend.

elbamos commented 7 years ago

@Dalar I have a fix up for testing. Its here in branch 'hotfix/sparsedivzero'. I've tested it with your dataset and code. Could you give it a try as well and confirm that the issue is resolved? Thanks...

amodig commented 7 years ago

@elbamos Yes, the code ran fine. Thanks for solving this so fast!

elbamos commented 7 years ago

Great! Please use the one from that branch, it'll probably be a few weeks before the change is rolled into cran.

On Nov 21, 2016, at 6:12 AM, Arttu Modig notifications@github.com wrote:

@elbamos Yes, the code ran fine. Thanks for solving this so fast!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.