asardaes / dtwclust

R Package for Time Series Clustering Along with Optimizations for DTW
https://cran.r-project.org/package=dtwclust
GNU General Public License v3.0
252 stars 29 forks source link

Fuzzy mediods clustering produces probabilities that don't sum to 1 #50

Closed PrachiLoliencar closed 2 years ago

PrachiLoliencar commented 3 years ago

Hi,

I have been using tsclust with the following call:

hdl=dtwclust(datalist, type="fuzzy", k=2, centroid = "fcmdd", fuzzy_control(fuzziness = 2, iter.max = 100L, delta = 0.001,
                                                                          packages = character(0L), symmetric = FALSE, version = 2L,
                                                                          distmat = NULL))

I ran this with similar data and got sensible results, but this particular call resulted in the following probabilities (using @fclust). Note that I had 74 observations and I have truncated the table as it just has more 0.5,0.5 rows. The last row does not sum to 1. What could be the cause of such strange results? Also, I ran this with fuzziness parameter 1.3 and got identical results.

Thanks!

Cluster Prob 1 Prob 2
1 0.5 0.5
1 0.5 0.5
1 0.5 0.5
1 0.5 0.5
1 0.5 0.5
1 0.5 0.5
1 0.5 0.5
1 0.5 0.5
1 0.5 0.5
1 0.5 0.5
1 0.5 0.5
1 0.5 0.5
1 1 1
asardaes commented 3 years ago

Hello, I'd pressume you have repeated time series in your data, and they were chosen as medoids. Let's say you have series A and B in your data, and they're both identical. If they are both chosen as separate medoids:

For series X | Distance to A = m | Distance to B = m ... For series A | Distance to A = 0 | Distance to B = 0

Therefore, the probability of X belonging to cluster A or to B is the same, but for series A itself (or B), the fuzzy update formula would return NaN (e.g. 1 / distance between A and A), and dtwclust simply substitutes any NaN with 1.

PrachiLoliencar commented 3 years ago

Hi Alexis,

Thanks so much for your kind help and apologies for the late response. I have been trying to figure out the issue with the time series but I am confused as they're all of different lengths (so no two time series can be identical). However, it is possible that some distance is close to 0? Calculating the distance by itself was taking too long (both on your package and Python using tsclust), so I am assuming there is some issue or singularities in my data.

However, speaking of the similar data that gave sensible results, I was curious to try L2 distance with reinterpolation and I still get strange results. For one, to the call

tsclust(datlist, type="fuzzy", k=4, distance = "L2", centroid = "fcmdd",fuzzy_control(fuzziness = 1.3, iter.max = 100L, delta = 0.001, packages = character(0L), symmetric = TRUE, version = 2L, distmat = NULL))

I get the error:

Error in do.call(".External", c(list(CFUN, x, y, pairwise, if (!is.function(method)) get(method) else method), : not a scalar return value In addition: Warning message: In sqrt(crossprod(x - y)) : NaNs produced

I tried using the sqrt function on the positive definite matrix with rows [2,-1], [-1, 2] and this results in NaNs where the negative entries are (i.e. does not give the sqrt matrix factorization). However, regardless I computed the distance matrix separately and supplied it to the distmat argument

tsclust(datlist, type="fuzzy", k=4, distance = "L2", centroid = "fcmdd",fuzzy_control(fuzziness = 1.3, iter.max = 100L, delta = 0.001, packages = character(0L), symmetric = TRUE, version = 2L, distmat = distmat))

and I got the following result, with lines omitted for brevity again. For some reason one of the points got made the centroid of two different clusters. In this case, the distmat has large values off the diagonal. I am assuming then, that perhaps the optimization produces only three unique centroids?

Cluster Prob 1 Prob 2 Prob 3 Prob 4
1 0.456952 0.04909 0.456952 0.037007
1 0.464018 0.041997 0.464018 0.029967
1 0.407307 0.094758 0.407307 0.090628
1 1 0 1 0
1 0.457591 0.050517 0.457591 0.034302
1 0.349449 0.119016 0.349449 0.182085
1 0.435793 0.076686 0.435793 0.051727
4 0 0 0 1
1 0.340437 0.17217 0.340437 0.146956
1 0.392626 0.144563 0.392626 0.070185
1 0.41 0.103521 0.41 0.076479
1 0.427475 0.084141 0.427475 0.060909
1 0.380208 0.117806 0.380208 0.121777
1 0.483235 0.018754 0.483235 0.014777
2 0 1 0 0
1 0.354937 0.149603 0.354937 0.140523
asardaes commented 3 years ago

That wouldn't surprise me, I do remember at least 1 or 2 times when I used partitional clustering for some k, and the iterations kept being unstable because k - 1 centroids were selected, and then another one had to be constantly reinitialized randomly. That's just from practical experience though, I'm not sure about the theory behind that. What do you see if you set trace = TRUE? And if you try partitional with centroid = "pam"? Do you see the convergence criterion behaving erratically?

The fact that the distance matrix can't be calculated directly already makes me think there's something weird. What do you see in the distance matrix you calculated? Are there many near-zero entries?

PrachiLoliencar commented 3 years ago

Hi Alexis,

I think you're right about the unstable results! I repeated the experiment twice with fuzzy clustering and pam clustering but with 30 data points to save time and results vary.

For Fuzzy I got

Precomputing distance matrix... Iteration 1: Changes / Distsum = 30 / 21055137 Iteration 2: Changes / Distsum = 0 / 21055137 Elapsed time is 3723.03 seconds.

I am unsure of what this means though (could not find Distsum in the documentation).

Clusters from the two runs: 1 1 1 1 1 1 3 1 1 4 2 4 4 1 1 1 1 1 1 1 1 1 1 3 1 3 1 1 1 1 2 1 3 3 4 2 4 3 3 2 3 3 3 3 3 4 3 3 3 4 3 4 2 4 3 4 3 4 2 4

Partitional with PAM was a little worse as it sometimes gives me clusters with only one element:

Precomputing distance matrix... Iteration 1: Changes / Distsum = 30 / 21075566 Iteration 2: Changes / Distsum = 3 / 19590664 Iteration 3: Changes / Distsum = 1 / 19517853 Iteration 4: Changes / Distsum = 0 / 19517853 Elapsed time is 4743.28 seconds.

Precomputing distance matrix... Iteration 1: Changes / Distsum = 30 / 24257620 Iteration 2: Changes / Distsum = 10 / 18620416 Iteration 3: Changes / Distsum = 0 / 18620416 Elapsed time is 15672.65 seconds.

and the respective clusinfo calls:

` size av_dist
1 26 649270.5
2 2 1318410.1
3 1 0.0
4 1 0.0
` size av_dist
1 1 0.0
2 22 589413.0
3 2 1255931.0
4 5 628293.6

Strangely, the amount of time required seems to vary as well, for the same data. How did you reinitialize to prevent fewer centroids by the way?

The distance matrix I computed had large entries (in the thousands) off the diagonal. I used the usual Euclidean norm to compute this between pairs of data points. I was curious if the code has a bug since the square root function cannot seem to handle negative values in matrices (even if it is positive definite) , thereby producing errors. However, then again, it seemed to handle it just fine with less data.

Thanks again for your help, and your your package!

asardaes commented 3 years ago

A couple of things. I'm not sure if your observations count as unstable, when I had that problem, the Distsum value increased instead of decreased sometimes. In that scenario, dtwclust automatically initializes a new cluster if needed. In fact, I have a somewhat contrived test for that:

data_reinterpolated <- reinterpolate(CharTraj, new.length = max(lengths(CharTraj)))

RNGversion("3.5.0")

pc_cr <- tsclust(data_reinterpolated, k = 20L,
                 distance = "lbk", centroid = "pam",
                 seed = 31231L, trace = TRUE,
                 control = partitional_control(iter.max = 10L, version = 1L),
                 args = tsclust_args(dist = list(window.size = 19L)))

Source code (note that this is only for partitional clustering).

I'm a bit confused about your statement regarding a "positive definite" matrix. Perhaps there's a misunderstanding due to terminology. When you use distance = "L2" (or "L1"), the calculation still considers each time series to be a vector, so even if we say "Euclidean norm", the calculation still computes an Euclidean distance between vectors. That, however, shouldn't have issues with negative values.