Closed gjmulder closed 4 years ago
Since you're using the default centroid (PAM), the calculated cross-distance matrix has num_ts ^ 2
elements, which already is larger than .Machine$integer.max
for num_ts = 46341
. This means that the matrix has to be indexed internally with something like long int
, and apparently the proxy
package cannot handle something related to that ("L2"
distance is provided by proxy
).
I believe the distances included with dtwclust
wouldn't have this specific limitation, in fact I tested this and it worked:
library(dtwclust)
series <- lapply(1:46342, function(ignored) { rnorm(5L) })
dm <- proxy::dist(series, series, method = "lbk", window.size = 1L)
This means that the problem you're having is due to the proxy
package. Nevertheless, I wonder if it's worth adding an implementation of Manhattan/Euclidean distance within dtwclust
to both leverage multi-threading and avoid this problem.
Changing from PAM to median seemed to fix things, thanks. I produced a somewhat inefficient method for estimating sqrt(.Machine$integer.max)
. Thanks for the quick response and solution!
I have started my investigation using the defaults, as you will observe that I have rather a lot of TS. I am using your supplied reinterpolate function to ensure my TS all have the same length.
I suspect any form of DTW is going to take considerable time to run, even with 32 cores and 64GB of memory I have available. Do you have any suggestions on what methods to use / avoid in terms of computational complexity and memory util?
If z-normalization isn't an issue, k-Shape might be a good starting point. If you definitely need time warping, do take a look at dtw_lb
with DBA or even mean for centroid calculation.
If your use case can use foreach
parallelization, you'd probably benefit from few R processes with many threads per process, say 4 processes with 8 threads each.
Make sure to give the vignette at least one full read. It's pretty long, but it should help you identify at least which things you definitely shouldn't use.
Thanks for your help. Great package for experimentation and exploration.
Reproduced with R version 3.4.4 and R version 3.6.1, using dtwclust both from CRAN and version 5.5.5 from Github.
The following code will reproduce the issue. Note that it takes considerable memory (> 16GB) and time to run the 46341 working test case, while 46342 throws the error immediately.