DataSlingers / clustRviz

Compute Convex (Bi)Clustering Solutions via Algorithmic Regularization
https://DataSlingers.github.io/clustRviz/
GNU General Public License v3.0
19 stars 14 forks source link

CARP crash #54

Closed alexpghayes closed 5 years ago

alexpghayes commented 5 years ago

Running into the following:

> carp_fit <- CARP(tf, labels = track_features$track_uri)
�X� should be a matrix, not a data.frame. Converting with as.matrix(). (Called from CARP)
Pre-computing weights and edge sets
Error in if (num_connected_old == num_connected) { : 
  missing value where TRUE/FALSE needed

Also, it's taking ~10 minutes just to get to this error on a 2600 by 11 matrix -- is that expected?

Can send data for a reprex if necessary.

michaelweylandt commented 5 years ago

Thanks. Will check out Tuesday or Wednesday.

Without profiling, I think that most of that slowness comes from running dist on your data since that check (for a connected weight graph) occurs before the actual CARP code starts.

Data to reproduce would be useful since I haven’t seen this before. Counting the number of connected components shouldn’t be producing NA....

You can also send directly to me if it’s sensitive.

M

On Sat, Oct 6, 2018 at 8:12 PM alex hayes notifications@github.com wrote:

Running into the following:

carp_fit <- CARP(tf, labels = track_features$track_uri) �X� should be a matrix, not a data.frame. Converting with as.matrix(). (Called from CARP)Pre-computing weights and edge setsError in if (num_connected_old == num_connected) { : missing value where TRUE/FALSE needed

Also, it's taking ~10 minutes just to get to this error on a 2600 by 11 matrix -- is that expected?

Can send data for a reprex if necessary.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/DataSlingers/clustRviz/issues/54, or mute the thread https://github.com/notifications/unsubscribe-auth/ABau6YU0n9eZlanv7Y0JgHp4yQ-H3I7Sks5uiVTogaJpZM4XLoDX .

alexpghayes commented 5 years ago

Data at: https://drive.google.com/file/d/1bBV2ECfmXhoaS6kcav1giSCG7QnjGZh8/view?usp=drivesdk

# this link is slightly different than the one above so that it works for direct downloads
track_features <- readr::read_csv("https://drive.google.com/uc?export=download&id=1OfewOmkgVpzgkr1_xvW2wepnD3wYdeDr")

tf <- track_features %>% 
  select_if(is.numeric)

carp_fit <- CARP(tf, labels = track_features$track_uri)
michaelweylandt commented 5 years ago

Thanks for sending -- I'm playing with it now and finding a few things on the slowness.

I still haven't found the source of the error (because of the slowness)

i) the internal function is_connected_adjacency_mat function uses an algorithm which scales badly. (Essentially, it checks all paths of length 1, then all paths of length 2, etc to make sure everything is connected. There's some linear algebra to make this faster than it might seem, but obviously it's still not great...)

I can speed this up by re-using a more efficient algorithm already in the C++ code. This takes the check time from around 30s to 0.2s at each iteration.

ii) Our default weighting scheme works by picking the smallest k-nearest-neighbor graph that is still fully connected. (See https://dataslingers.github.io/clustRviz/articles/ClustRVizWeights.html)

For this data, it looks like we need a very high value of k (still running, but at least > 250), which means we have lots of weights / edges to handle, which makes everything else slow.

This seems to be caused by a few outliers which are not in the nearest neighbors of anything else. (Take a look at pairs((tf %*% prcomp(tf)$rotation)[,1:3]) to see what I mean)

iii) We find the smallest value of k by starting at k = 1, then trying k = 2, etc until we find something that works. This might not be optimally efficient for cases when the desired k is large. I can think about a bisection heuristic which might speed things up.

michaelweylandt commented 5 years ago

I might have spoken a bit too soon re: ii above.

I think a small k would give a connected graph, but the weight (inversely proportional to the distance from the outlier to the bulk) is getting rounded to zero somewhere. Still diving into it...

alexpghayes commented 5 years ago

I think there are some duplicated data points in case that's a concern.

alexpghayes commented 5 years ago

I did try running things with the duplicates removed but I was still crashing.

michaelweylandt commented 5 years ago

Ok - I've found the problem that leads to the immediate error and it's essentially a numeric overflow getting turned into a NA which breaks a conditional. That's an easy fix since my new check for connectedness doesn't use multiplications.

Still working on the apparent need for large k problem

michaelweylandt commented 5 years ago

Ok -- I've found the problem:

library(dplyr)
library(readr)
library(matrixStats)
track_features <- readr::read_csv("https://drive.google.com/uc?export=download&id=1OfewOmkgVpzgkr1_xvW2wepnD3wYdeDr")
tf <- track_features %>% select_if(is.numeric) %>% as.matrix

dtf <- as.matrix(dist(tf))
diag(dtf) <- Inf # Essentially ignore the diagonal, b/c weights to self are meaningless
big_distance <- max(colMins(dtf))

stopifnot(exp(-big_distance) == 0)

Essentially, for (at least) one observation, it's so far away from any other point that using our standard weighting scheme of w_ij = exp(-d_ij^2), all of its weights are set to zero and it cannot be clustered.

I can definitely catch this earlier and give a clearer error message, but I'm not sure about the best way to handle it if we do want to cluster this data. Saying your data is too well separated to cluster seems like a pretty pathetic cop-out...

In this case, standardizing the columns of X seems to fix it, but I'm not sure if that's correct here (given the domain) or a generally robust answer.

michaelweylandt commented 5 years ago

Partial fix (for the speed stuff, not the zero weight issue) in #55.

michaelweylandt commented 5 years ago

I've added a check for totally disconnected vertices as part of PR #55 -- it's not a full solution, since it only moves the error earlier (avoids checking all possible graph sparsification levels) but should be a bit more user friendly.

michaelweylandt commented 5 years ago

Partial fix merged in #55. Opening a new issue for a more permanent solution.