asardaes / dtwclust

R Package for Time Series Clustering Along with Optimizations for DTW
https://cran.r-project.org/package=dtwclust
GNU General Public License v3.0
252 stars 29 forks source link

negative length vectors are not allowed - dtwclust with L2 distance works with 46341 TS but not 46342 TS #44

Closed gjmulder closed 4 years ago

gjmulder commented 4 years ago
[1] 46342
Error in do.call(".External", c(list(CFUN, x, y, pairwise, if (!is.function(method)) get(method) else method),  : 
  negative length vectors are not allowed
Calls: tsclust ... <Anonymous> -> .proxy_external -> do.call -> .External
Execution halted

Reproduced with R version 3.4.4 and R version 3.6.1, using dtwclust both from CRAN and version 5.5.5 from Github.

The following code will reproduce the issue. Note that it takes considerable memory (> 16GB) and time to run the 46341 working test case, while 46342 throws the error immediately.

library(dtwclust)
for (num_ts in 46341:46342) {
  print(num_ts)
  cl_k_nrep <-
    tsclust(lapply(1:num_ts, function(x)
      return(0)),
      k = 2,
      distance = "l2")
}
asardaes commented 4 years ago

Since you're using the default centroid (PAM), the calculated cross-distance matrix has num_ts ^ 2 elements, which already is larger than .Machine$integer.max for num_ts = 46341. This means that the matrix has to be indexed internally with something like long int, and apparently the proxy package cannot handle something related to that ("L2" distance is provided by proxy).

I believe the distances included with dtwclust wouldn't have this specific limitation, in fact I tested this and it worked:

library(dtwclust)
series <- lapply(1:46342, function(ignored) { rnorm(5L) })
dm <- proxy::dist(series, series, method = "lbk", window.size = 1L)

This means that the problem you're having is due to the proxy package. Nevertheless, I wonder if it's worth adding an implementation of Manhattan/Euclidean distance within dtwclust to both leverage multi-threading and avoid this problem.

gjmulder commented 4 years ago

Changing from PAM to median seemed to fix things, thanks. I produced a somewhat inefficient method for estimating sqrt(.Machine$integer.max). Thanks for the quick response and solution!

I have started my investigation using the defaults, as you will observe that I have rather a lot of TS. I am using your supplied reinterpolate function to ensure my TS all have the same length.

I suspect any form of DTW is going to take considerable time to run, even with 32 cores and 64GB of memory I have available. Do you have any suggestions on what methods to use / avoid in terms of computational complexity and memory util?

asardaes commented 4 years ago

If z-normalization isn't an issue, k-Shape might be a good starting point. If you definitely need time warping, do take a look at dtw_lb with DBA or even mean for centroid calculation.

If your use case can use foreach parallelization, you'd probably benefit from few R processes with many threads per process, say 4 processes with 8 threads each.

Make sure to give the vignette at least one full read. It's pretty long, but it should help you identify at least which things you definitely shouldn't use.

gjmulder commented 4 years ago

Thanks for your help. Great package for experimentation and exploration.