Closed seanlaw closed 2 years ago
In order to accomplish this, we'll first need a way to re-compute the Matrix Profile Subspace (see this issue)
The work on MDL and supsaces is now complete!
@SaVoAMP said:
Since k can be selected automatically now, shouldn't we generalize the functions stumpy.motifs and stumpy.matches to the multidimensional case for our motif discovery tutorial?
@SaVoAMP Would you be open to tackling this first (and I can assist)? So, this would be adding multi-dimensional support to stumpy.motifs
and stumpy.match
while the tutorial in #518 would come after this issue is resolved.
Sure, I could try!
So you want that the already existing functions can be generalized to the multidimensional case somehow? Or do you want something like stumpy.Mmotifs
and stumpy.Mmatch
that is only working together with MStump
?
And how should I proceed then? Should I open a new Python script or write into the predefined functions and add a variable like multi_dim
that is False
per default, but True
if I want to use stumpy.motifs
and stumpy.match
together with mStump
for the multidimensional motif discovery?
So you want that the already existing functions can be generalized to the multidimensional case somehow? Or do you want something like stumpy.Mmotifs and stumpy.Mmatch that is only working together with MStump?
So, instead of forcing ourselves to perfect things upfront, I think it would be more pragmatic to simply work on writing an analogous, stumpy.Mmotifs
and stumpy.Mmatch
function that resemble the logic of stumpy.motifs
and stumpy.match
as close as possible. I think that once we see those functions then we can move toward integrating them together if it makes sense. However, if it doesn't make sense (e.g., if it would make it too complicated to jam everything together) then we could modify stumpy.motifs
a tiny bit by adding a special decorator to it that simply detects the dimensions of the matrix profile and it will automatically hand over the work to stumpy.Mmotifs
. However, let's worry about this part later! First, let's write a stumpy.Mmotifs
and stumpy.Mmatch
function that mimics stumpy.motifs
and stumpy.match
, repsectively.
And how should I proceed then? Should I open a new Python script or write into the predefined functions and add a variable like multi_dim that is False per default, but True if I want to use stumpy.motifs and stumpy.match together with mStump for the multidimensional motif discovery?
I think you can modify motifs.py
and add two new functions there called def Mmotifs()
and def Mmatch()
for now. In fact, in the best case scenario (I don't know if this is even possible), maybe Mmotifs
does the mdl
work but just ends up calling motifs()
once Mmotifs
determines the subspace to use. Again, it's hard to know ahead of time what is "better" so we need to be okay with taking the first few rounds just to learn "good vs. bad" and treat them as "throw away". Then, as wiser people, we can try it again. How does that sound to you?
First, let's write a stumpy.Mmotifs and stumpy.Mmatch function that mimics stumpy.motifs and stumpy.match, repsectively.
Alright, I will start today with the stumpy.Mmotifs
function and will then continue to work on it whenever I have some time besides phyics! After my last exam in February at the latest, I will have enough time to work on it.
Sounds good @SaVoAMP! Thank you for letting me know
Note that you'll need to clone the latest development version of STUMPY to use it.
Stupid question: "Where can I find the development branch?" 😄 I only see the main branch there.
I have already forked it like the tutorial suggested (I am quite knew to git).
Stupid question: "Where can I find the development branch?"
Not a stupid at all (and likely something we should clarify in the contribution guide)! So, the "development version" is what currently sits at https://github.com/TDAmeritrade/stumpy. So, this is the latest most up-to-date version that is actively being developed. You've forked this development version (which is a copy of the entire repository and this fork sits in your personal space) and so you will then clone your own fork (i.e., you copy from your fork to your local computer). Then, to start working on your local clone (it will start off on the "main" branch), but you'll want to create a new working branch (maybe call it "multi_motif_match"), switch to that new branch, and then do work on that new branch.
Let me know if you have more questions or want to check your understanding. If you'd like, you might consider starting by creating a new branch for updating the contributor guide (with any missing or unclear explanation) and submitting a quick PR just to see how things work.
Ahh, I think I have already done it, but didn't recognize that stumpy.mdl
is located in mstump.py
. I was searching for it under stumpy
and have therefore thought that I have done something wrong.
Thank you!
Today I finished the first short version of Mmotifs
using MDL.
I don't know why, but I get different results when I try to find k
manually and compute the subspace with stumpy.subspace
afterwards and then I tried to find k
and the subspace with 'stumpy.mdl'.
If I manually compute k
as shown I would suggest that k=9
. But MDL returns k=6
.
If I still choose k=6
and compute the subspace afterwards, I also get different subspaces compared to the subspace resulting after calling stumpy.motifs
. Therefore I also get completely different motifs and nearest neighbors. I think one of the two subspace computations might cause the trouble. Any ideas on how to narrow it down?
Of course I might have missed something in my code, but I didn't do anything else with the subspaces and therefore I think they should be the same if I choose the same k
. If you want to take a look at the code, I have just pushed it on my branch of the fork.
I don't know why, but I get different results when I try to find k manually and compute the subspace with stumpy.subspace afterwards and then I tried to find k and the subspace with 'stumpy.mdl'.
So, I did make a change to stumpy.subspace
so, given the same subseq_idx
and nn_idx
, it should match the subspace output for stumpy.mdl
(though, you would need to select the k
th dimension from the MDL subspace output). For your awareness, the MDL computed subspace first takes each pair of multi-dimensional subsequences and applies a discretization function to them that converts the z-normalized floating point numbers to integers (by binning the floats and assigning them to some 8-bit range of values). Then, it computes the corresponding subspace from this discretized subsequence before using it to compute the bit size.
I've only tested it on the Tutorial example and so it is perfectly reasonable for their to be mistakes that I was not able catch!
If I manually compute k as shown I would suggest that k=9. But MDL returns k=6.
Can you plot the corresponding bit size vs k
plot instead? Presumably, the minimum point is at k = 6
?
If I still choose k=6 and compute the subspace afterwards, I also get different subspaces compared to the subspace resulting after calling stumpy.motifs. Therefore I also get completely different motifs and nearest neighbors. I think one of the two subspace computations might cause the trouble. Any ideas on how to narrow it down?
Can you share the data that you are using along with the commands that causes the discrepancies? It is hard for me to follow the explanation. From what I can extract, it isn't only a matter of choosing k
but it is also paired up with subseq_idx
and nn_idx
. Maybe let's set aside stumpy.motifs
for now until we can be certain that stumpy.mdl
and stumpy.subspace
are doing what we expect.
Of course I might have missed something in my code, but I didn't do anything else with the subspaces and therefore I think they should be the same if I choose the same k. If you want to take a look at the code, I have just pushed it on my branch of the fork.
I will try to take a look. Thank you for bringing it to my attention
@SaVoAMP
Should this be m = T.shape[1] - P.shape[1] + 1
(instead of T.shape[0]
) since T.shape[0]
is the total number of dimensions and not related to the subsequence length?
According to the quote from the original paper:
There are two steps in each iteration: 1) apply the MDL-based method to find the k-dimensional motif with the minimum bit size and 2) remove the found k-dimensional motif by replacing the matrix profile values of the found motif (and its trivial match) to infinity. If we apply the two steps above to the time series shown in Fig. 5, the 3-dimensional motif would be discovered in the first iteration, and the 2-dimensional motif would be discovered in the second iteration.
It feels like core.apply_exclusion_zone(P, motif_idx, excl_zone, np.inf)
might be doing too much. This would exclude all dimensions and not only a subset of the k
dimensions. However, currently, apply_exclusion_zone
doesn't know or understand how to operate on a subset of dimensions. We might need to add an additional parameter like S
to the end to help specify the subset of dimensions to exclude.
After you've updated the candidat_idx
at the end of the for-loop, should we also update nn_idx
with:
candidate_idx = np.argsort(P, axis=1)[:, 0]
nn_idx = I[np.arange(len(candidate_idx)), candidate_idx]
If so, it's probably not a great idea to be changing nn_idx = np.argmin(D[k, :])
as it can get confusing. It might be better to choose a different variable name.
So, I did make a change to stumpy.subspace so, given the same subseq_idx and nn_idx, it should match the subspace output for stumpy.mdl (though, you would need to select the kth dimension from the MDL subspace output).
Ahh ok, I didn't notice and therefore I was using the stumpy.subspace
function of the latest release and not of the development version and compared it to the MDL approach of the development version.
When using both in the development version I obtain the same subspaces. Does that mean that there was a mistake in the verison that is currently available and I get wrong outputs there?
Should this be m = T.shape[1] - P.shape[1] + 1 (instead of T.shape[0]) since T.shape[0] is the total number of dimensions and not related to the subsequence length?
I was a bit confused because T had shape (n, d) but P had shape (d, n-m+1) and I wanted to get the length of both and not of the dimensions. Maybe we kneed some kind of case query here? But I don't know how we could be sure which is the dimension and which the length. Of course in the most cases the length will be longer as the dimensions, but I don't know if we can generalize that.
When using both in the development version I obtain the same subspaces. Does that mean that there was a mistake in the verison that is currently available and I get wrong outputs there?
I wouldn't necessarily call it a mistake. It's just that stumpy.mdl
requires an extra step of applying a discretization function to the subsequences before computing the subspace and so, for consistency, I am now enforcing that same discretization in stumpy.subspace
. However, the discretization can change the selected subspace. I believe that it may be possible to recover the original (un-discretized) subspace by passing in a fake discretization function to stumpy.subspace
that does nothing:
def fake_disc_func(subseq):
return subseq
stumpy.subspace(T, m, subseq_idx, nn_idx, k, discretize_func=fake_disc_func)
Let me know if that makes sense
It feels like core.apply_exclusion_zone(P, motif_idx, excl_zone, np.inf) might be doing too much. This would exclude all dimensions and not only a subset of the k dimensions. However, currently, apply_exclusion_zone doesn't know or understand how to operate on a subset of dimensions. We might need to add an additional parameter like S to the end to help specify the subset of dimensions to exclude.
Isn't it possible to write P[k, :]
instead of P
then? Like we have done it with the Distance Profile in order to find the nearest neighbors? But this won't be an option since the multidimensional Matrixprofile is not composed of stacked 1D matrix profiles?
I was a bit confused because T had shape (n, d) but P had shape (d, n-m+1) and I wanted to get the length of both and not of the dimensions. Maybe we kneed some kind of case query here? But I don't know how we could be sure which is the dimension and which the length. Of course in the most cases the length will be longer as the dimensions, but I don't know if we can generalize that.
In this case is T
a numpy
array or a pandas
dataframe? If it is the latter then we are missing an important step in stumpy.motifs
that will convert all dataframes to its corresponding array (in the correct orientation) since dataframes have their dimensions swapped from numpy arrays. In STUMPY, we always assume that we are dealing with numpy arrays internally for all computations. Dataframes are accepted but they are always converted first.
Typically, we do this as soon as we enter the function:
T, M_T, Σ_T = core.preprocess(T, m)
The core.preprocess
function not only computes the mean/stddev but it also does a few more "nice" things:
T = T.copy()
T = transpose_dataframe(T)
T = np.asarray(T)
check_dtype(T)
check_window_size(m, max_size=T.shape[-1])
T[np.isinf(T)] = np.nan
M_T, Σ_T = compute_mean_std(T, m)
T[np.isnan(T)] = 0
return T, M_T, Σ_T
T
is a dataframe then it will transpose it so that its orientation is the same as a numpy array (but does nothing if it is not a dataframe)np.asarray
to ensure that any dataframe will be converted to a numpy arraynp.float64
And a few more things. Once those checks are in place then we are in a pretty good place. So, I recommend doing:
T, M_T, Σ_T = core.preprocess(T, m)
At the top of your function and then you can use:
m = T.shape[-1] - P.shape[-1] + 1
Note that I have switched to [-1]
so that we choose the last dimension instead of the "second" dimension as a precaution.
Ahh all right! Thank you! Yes, T is a dataframe. I will change it in the code!
Wait a second. How can we write T, M_T, Σ_T = core.preprocess(T, m)
if we are computing m
afterwards and not before?
But I could write this lines of code instead before writing m = T.shape[-1] - P.shape[-1] + 1
,or?
T = T.copy()
T = core.transpose_dataframe(T)
T = np.asarray(T)
core.check_dtype(T)
Wait a second. How can we write T, M_T, Σ_T = core.preprocess(T, m) if we are computing m afterwards and not before? But I could write this lines of code instead before writing m = T.shape[-1] - P.shape[-1] + 1,or?
Yes, I think what you've proposed should work. This is a rare chicken-and-egg problem 😄
Isn't it possible to write P[k, :] instead of P then? Like we have done it with the Distance Profile in order to find the nearest neighbors? But this won't be an option since the multidimensional matrix profile is not composed of stacked 1D matrix profiles?
I don't know but this simple example doesn't seem to work:
import numpy as np
from stumpy.core import apply_exclusion_zone
from stumpy import config
P = np.random.rand(50).reshape(5, 10)
m = 5
excl_zone = int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM))
subspace = [1,3]
apply_exclusion_zone(P[subspace], 4, excl_zone, np.inf)
P
However, this is a hack, but should work:
import numpy as np
from stumpy.core import apply_exclusion_zone
from stumpy import config
P = np.random.rand(50).reshape(5, 10)
m = 5
excl_zone = int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM))
subspace = [1,3]
P_subspace = P[subspace]
apply_exclusion_zone(P_subspace, 4, excl_zone, np.inf)
P[subspace] = P_subspace
P
I don't love it though and we should probably add an option to apply_exclusion_zone
to accept a subspace parameter. I think it would be a simple adjustment
T = T.copy() T = core.transpose_dataframe(T) T = np.asarray(T) core.check_dtype(T)
I added a new issue for this #536
Are you sure, that we don't want to apply the exclusion zone to all dimensions of P
like it is so far?
The picture above is from the multidimensional matrix profile paper. They mention that the lower dimensional motif may or may not be a subset of dimensions in the higher dimensional motif pair. But I think that it doesn't make sense to choose the lower dimensional motif as the second found multidimensional motif
, if we have found the "same" motif in more dimensions before. Or what do you think? Therefore I am wondering if we shouldn't leave it the way it is?
And while thinking about it, another question came to my mind: "Why is it ok to set the exclusion zone to D[k]
, so that the k
-th dimension of the matrix profile that was computed for the motif position is set to infinity?"
I have tried to visualize how the mSTAMP
algorithm calculates the first k-dimensional matrix profile value:
If I understood everything correctly, one can obtain one single dimension of the multidimensional Distance Profile by simply computing the z-normalized Euclidean Distances for the query subsequence (in our case the motif) on the corresponding time series. For example we gain D_1[2]
(in essence the second dimension of the first distance profile) by computing the distances in the second time series (T_2
) for the first query subsequence, according to my visualization scheme. Wouldn't this mean that we only eliminate the values around the motif position only for the k
-th dimension of the multidimensional time series, when setting core.apply_exclusion_zone(D[k, :], motif_idx, excl_zone, np.inf)
? Meaning that we only apply the exclusion zone to one (the k
-th) dimension instead of all k
dimensions where the motif is present? Moreover we don't have the information yet in WHICH dimensions of the whole time series the motif is represented, since we don't include the subspace
at that point.
Could be possible that I am now completely confused and mix everything up?
import numpy as np from stumpy.core import apply_exclusion_zone from stumpy import config
P = np.random.rand(50).reshape(5, 10) m = 5 excl_zone = int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM))
subspace = [1,3]
P_subspace = P[subspace] apply_exclusion_zone(P_subspace, 4, excl_zone, np.inf) P[subspace] = P_subspace P
I don't think that we can do it like this, since the single dimensions of the multidimensional matrix profile don't correspond to the separate time series of the subspace. According to your example where subspace[1,3]
you would set the first and third dimension of the matrix profile to np.inf
near the motif position, I think. But the first and thrid dimension of the matrix profile doesn't correspond to the first and third time series respectively. If I am not mistaken, the first dimension of the matrix profile would give us the best motif in one dimension and the third matrix profile dimension would give us the best motif in three dimensions (we don't know which dimensions until we computed the subspace) by taking the average of those three dimensions. The third matrix profile dimension wouldn't correspond to the time series with the number 3
, which is the information taken from the subspace. Please correct me if I am wrong (it's been a long day and my attention span might be drained since it is very late in Germany 😄)!
Are you sure, that we don't want to apply the exclusion zone to all dimensions of P like it is so far? The picture below is from the multidimensional matrix profile paper. They mention that the lower dimensional motif may or may not be a subset of dimensions in the higher dimensional motif pair. But I think that it doesn't make sense to choose the lower dimensional motif as the second found multidimensional motif, if we have found the "same" motif in more dimensions before. Or what do you think? Therefore I am wondering if we shouldn't leave it the way it is?
Thank you. You have convinced me that, when applying the exclusion zone, we should apply it to all dimensions and not only to the subspace. In fact, our lives would be made a lot easier if we simply applied the exclusion zone to all dimensions.
I don't think that we can do it like this, since the single dimensions of the multidimensional matrix profile don't correspond to the separate time series of the subspace. According to your example where subspace[1,3] you would set the first and third dimension of the matrix profile to np.inf near the motif position, I think. But the first and thrid dimension of the matrix profile doesn't correspond to the first and third time series respectively. If I am not mistaken, the first dimension of the matrix profile would give us the best motif in one dimension and the third matrix profile dimension would give us the best motif in three dimensions (we don't know which dimensions until we computed the subspace) by taking the average of those three dimensions. The third matrix profile dimension wouldn't correspond to the time series with the number 3, which is the information taken from the subspace. Please correct me if I am wrong (it's been a long day and my attention span might be drained since it is very late in Germany 😄)!
Okay, I think you have brought up an important distinction and I have been lazy/imprecise in my code. Many apologies! Technically, in the code above, I probably should have been clearer and wrote:
import numpy as np
from stumpy.core import apply_exclusion_zone
from stumpy import config
P = np.random.rand(50).reshape(5, 10)
m = 5
excl_zone = int(np.ceil(m / config.STUMPY_EXCL_ZONE_DENOM))
mdl = [10.0, 0.3, 15.0, 20.0, 30.0]
subspaces = [[0], [1,3], [0, 3, 4], [0, 2, 3, 4], [0, 1, 2, 3, 4]]
k = np.argmin(mdl)
P_subspace = P[subspaces[k]]
apply_exclusion_zone(P_subspace, 4, excl_zone, np.inf)
P[subspaces[k]] = P_subspace
To be clear, I am no longer advocating for applying the exclusion zone to the subspace so this code is no longer relevant. But, hopefully, this is more accurate in terms of conveying my previous point. Does this make more sense?
Perhaps, when we have the subspaces for multiple dimensions, we should call that "subspaces" (plural) and when we have the k
th subspace then it is singular (i.e., "subspace")? I was hoping that in my original version that subspace = [1, 3]
would have conveyed that it is coming from the k = 1
subspace. However, I can see that I've made an assumption in isolation and in haste. If we agree on the use of "plural subspaces" for a list that contains more than one subspace vs "singular subspace" then this means that I should update the Tutorial to use subspaces
when the subspaces are computed via stumpy.mdl
. And, for consistency, 'stumpy.subspaceonly computes the singular subspace for the
k`th dimension. What do you think?
If we agree on the use of "plural subspaces" for a list that contains more than one subspace vs "singular subspace" then this means that I should update the Tutorial to use subspaces when the subspaces are computed via stumpy.mdl. And, for consistency, 'stumpy.subspaceonly computes the singular subspace for thek`th dimension. What do you think?
Yes, I think that would make sense to avoid any misunderstandings.
And while thinking about it, another question came to my mind: "Why is it ok to set the exclusion zone to D[k], so that the k-th dimension of the matrix profile that was computed for the motif position is set to infinity?"
What would you say to this? (I meant D[k, :] instead of D[k] and distance profile instead of matrix profile, I am very sorry!)
I don't understand why setting core.apply_exclusion_zone(D[k, :], motif_idx, excl_zone, np.inf)
is enough. Here I would do exactly what you wanted to do with the multidimensional matrix profile. Hence I would compute the subspace and set every dimension of the Distance Profile to infinity (near the starting position of the motif) that corresponds to the found time series dimensions in the computed k
-th subspace. I would use the code you've suggested for the distance profile instead of the matrix profile, if my understanding of the multidimensional distance profile calculation is correct?
Yes, I think that would make sense to avoid any misunderstandings.
Done!
What would you say to this? (I meant D[k, :] instead of D[k] and distance profile instead of matrix profile, I am very sorry!) I don't understand why setting core.apply_exclusion_zone(D[k, :], motif_idx, excl_zone, np.inf) is enough. Here I would do exactly what you wanted to do with the multidimensional matrix profile. Hence I would compute the subspace and set every dimension of the Distance Profile to infinity (near the starting position of the motif) that corresponds to the found time series dimensions in the computed k-th subspace. I would use the code you've suggested for the distance profile instead of the matrix profile, if my understanding of the multidimensional distance profile calculation is correct?
So, I think that, fundamentally, there are two separate stages where core.apply_exclusion_zone
needs to be used:
P
the matrix profile)D
the distance profile)I think you need to do both. In my code above, I was only looking for motifs. However, after you discover your first motif from P
(the matrix profile), then you'll want to find the top-W matches (I'll use an arbitrary W
here since we're already using k
to represent the k
th subspace) within that subspace that resemble this motif subsequence. Finding matches is where D
comes into play. So, the sequence of steps might be:
P
(multi-dimensional matrix profile computed from mstump
)motif_idx
nn_idx
that corresponds to motif_idx
mdls
and subspaces
at those motif_idx
and nn_idx
locationsk
at np.argmin(mdls)
motif_idx[k]
is the specific dimension of the motif index that we are interested in along with subspaces[k]
T[subspaces[k], motif_idx[k] : motif_idx[k] + m -1 ]
)D
for the this motif/subspace pair relative to the rest of the time series (but within the same subspace)D
, you apply an exclusion zone so that the same match is not found again and you repeat this until the top-W matches are found (or you break if np.inf
is found)P
P
so that the same motif is not found again in the next iteration. Note that this exclusion zone is applied to ALL dimensions and not only to the subspaces[k]
motif_idx
"Hopefully, I haven't trivialized it too much and missed some important steps but this is roughly my mental model (based on the 1-dimensional version). Does this match what you are thinking? Any modifications?
Yes, I would do it exactly as you described. But I am wondering if we did something wrong in the steps 7 to 9. I think we need to set the exclusion zone to all subspace dimensions as you mentioned in step 7. But currently I am doing the following that we already did in discussion #494:
# Get k-dimensional motif
motif_idx = candidate_idx[k]
# Get multidimensional Distance Profile
T, mean_T, sigma_T = core.preprocess(sub_dims, m)
D = _multi_distance_profile(
motif_idx, T, T, m, excl_zone, mean_T, sigma_T, mean_T, sigma_T
)
# Find the max_matches nearest neighbors
motif_neighbors_distances = []
motif_neighbors_distances.append(0)
motif_neighbors_indices = []
motif_neighbors_indices.append(motif_idx)
nearest_neighbor_idx = np.argmin(D[k, :])
while len(motif_neighbors_distances) < max_matches:
motif_neighbors_distances.append(D[k, :][nearest_neighbor_idx])
motif_neighbors_indices.append(nearest_neighbor_idx)
core.apply_exclusion_zone(D[k, :], nearest_neighbor_idx, excl_zone, np.inf)
# Find the next nerarest neighbor index after setting the exclusion zone
nearest_neighbor_idx = np.argmin(D[k, :])
I am wondering if we made a mistake here since I wouldn't set the exclusion zone to D[k, :]
anymore. I would set the exclusion zone to all subspace dimensions and not only to k in the distance profile after we have found each match. Therefore I was asking whether it is right to use your code that was previously thought to be applied to the Matrix Profile, but use it for setting the exclusion zones in the Distance Profile?
So I don't think setting core.apply_exclusion_zone(D[k, :], nearest_neighbor_idx, excl_zone, np.inf)
would be enough here, if I understood everything correctly.
Yes, I would do it exactly as you described. But I am wondering if we did something wrong in the steps 7 to 9. I think we need to set the exclusion zone to all subspace dimensions as you mentioned in step 7.
Yes, I see what you mean now. D[k, :]
would be totally wrong as k
is the "index for the dimension of interest". If k = 1
(i.e., 2 dimensions are needed), then D[k, :]
would only target the second row of D
(i.e., just a single row) and nothing else. In fact, if the subspace computed from stumpy.mdl
, namely subspaces[k]
, targeted dimensions [1, 3]
then we should be applying the exclusion zone to D[subspace[k]]
, which is totally different from D[k]
(which is incorrect).
Btw, it is really helpful for me to talk through this so thank you for engaging!
Super, perfect! Then I seem to have understood it correctly now (it helped a lot to visualize the procedure with the pseudo code of mSTAMP) 😄. Then I try to implement the whole thing as discussed the days! Hopefully, we will have taken everything into account then 😅
Btw, it is really helpful for me to talk through this so thank you for engaging!
I have to thank you for taking so much time to do this! I'm really learning a lot right now and understanding the multidimensional motif search better and better!
I think we have another problem now :(
Now it isn't that easy to find the nearest neighbor index, since it is incorrect to write motif_neighbors_idx = np.argmin(D[k, :])
for finding the next multidimensional match as we have done before.
I am not sure how to proceed now since D[subspace]
would give us k+1
different indices for the next match (before applying the exclusion zone to all subspace dimensions).
I think we have to do it the way that the original authors compute the multidimensional matrix profile, but I am not sure. I would try it like this though (just as an experiment and without any fundamental understanding):
D = _multi_distance_profile(motif_idx, T, T, m, excl_zone, mean_T, sigma_T, mean_T, sigma_T)
.match_distances = D[subspace]
.k+1
(number of relevant dimensions) to get the averaged values. Do you have any ideas on that (seems to be more difficult than I anticipated)?
Your proposal sounds correct! Wow, you're REALLY understanding this stuff and I'm really excited for you! I'd bet that you may be one of only like 10 people in this entire world whose ever dug in this deep 😄
Edited: Please see my next comment instead and ignore this part
If you look at the _mstump
function, it is essentially following your steps, except that it handles other additional things like adjusting the multi-dimensional distance profile to account for discords or specific dimensions to include:
So, in between Step 1 and Step 2, you'd also do:
This means that you'll need to add include
and discords
parameters as input to your mmotifs
and mmatch
functions. And then continue with your Step 3 and Step 4 until you've run out of distances to process. Let me know if that makes sense.
Actually, on further inspection, there is already a (private) function in mstump
that will do all of this work for you of computing the multi-dimensional distance profile (and this is what you already pointed out):
So, you give it all of the right inputs and it will spit back out the D
that you need. In other words, it's already computed all of the averages for you and applied all of the include
and discords
parameters and the trivial match exclusion zone has been applied) so there's nothing additional for you to do besides use D
. So, the D
that is produced isn't merely pairwise distances.
Essentially, I think one of these functions basically replaces your first 3 steps above. I haven't touched this in a long, long time and so I had forgotten about it but I had the foresight back then to anticipate this future need. This is definitely one of those moments when "past Sean had empathy for future Sean" 😄
Thank you!
Hmm, I am still wondering if this does exactly what we want. Let's imagine that we have a time series with d=4
dimensions. And we found out that the motif is represented in k=3
of these dimenisions. The subspace
tells us, that the time series T_1
, T_2
and T_4
contain the motif. So this function would (please correct me if I am wrong) compute the one dimensional distance profiles for each of the four time series in the first step, sort these values in the second step and afterwards obtain the multidimensional distance profile via lines 620 to 622. So after using _multi_distance_profile()
I would choose D[3]
to get the distance profile I am searching for. Therefore all time series dimensions were used. But wouldn't this imply that it is possible to find the next match (after extracting the minimum value of D[3, :]
) of which the value was derived from the time series T_1
, T_2
and T_3
. But the motif isn't present in T_3
but instead in 'T_4'. Therefore I am wondering if it is accurate to compute the distance profile using all dimensions if we already know in which dimensions the motif appears.
My attempt would be to only compute the 1D Distance Profiles for the dimensions where the motif is represented (here: T_1
, T_2
and T_4
), sum up their values and divide by k=3
. I wouldn't calculate the whole Distance Profile for all possible dimensions, but only the dimension I am interested in instead.
Ok, maybe that would sound too simple to be reasonable. The point I want to get across here is the following: "If we would compute the multidimensional matrix profile via _multi_distance_profile()
and pick out the third dimension (since k=3
and I am not counting zero-based here) we should be able to find the corresponding subspace after we got the minimum value. But what if this subspace tells us that the match was found in different dimensions than the motif?"
I'm sorry if I sound a bit confusing! I'm preparing for my particle physics exam on Tuesday and I am a bit stressed 😄 I guess I should shift my focus on exam prep for the next days and continue thinking about this after it's over. Hopefully a fresh and clear state of mind will allow me to successfully implement this stuff soon :D
@SaVoAMP I see your point and I think you are right! Is this what you were thinking?
# k is zero-based here
D = _multi_mass(
T[subspaces[k], motifs_idx[k] : motifs_idx[k] + m],
T[subspaces[k]],
m,
M_T[subspaces[k]],
Σ_T[subspaces[k]],
μ_Q[subspaces[k], motifs_idx[k]],
σ_Q[subspaces[k], motifs_idx[k]],
)
D = D.mean(axis=0)
If so, I wonder if all of this is already done for you in stumpy.match
? I think that when @mihailescum wrote the stumpy.match
function, he had anticipated multi-dimensional input as evidenced by:
This is essentially doing the same thing as _multi_mass
above (which computes 1D distance profiles) and then taking the average by dividing by the number of total dimensions. Would you mind taking a closer look at stumpy.match
and seeing if it suits your needs?
Yes, it looks like something I could need!
My main problem wasn't that I couldn't get it programmed, but that I wanted to make sure I got it right and could then implement it correctly. I didn't want to just do anything on my own now when I might have just had a thinking error (since we had done it differently during the discussions on multidimensional motif search), so I would check it off first. So thank you very much for your help! I think now we should have considered the most important things, so I can implement it after the test.
Agreed. Your thinking is sound and it appears consistent with what we had planned for before. It's been so long for me that I've forgotten a lot of the tricky nuance and you are making tremendous progress. Best of luck on your particle physics exam!
@SaVoAMP I hope that your exam went well! I just wanted to give you a heads up that a lot of changes were recently committed where motifs.py
has been modified and core.py
has new core._preprocess
function that you may be interested in using. At your earliest convenience, you'll want to "fetch upstream" and pull the latest changes into your local branch (see this section of the Contributors guide).
All right, thanks for the information!
yes, I'm just starting to understand again what I meant last time. But right now I am somehow completely confused and do not know at all what is still right and what not here (I have managed to completely confuse myself) 😅
Do you think that your max_distance
and cutoff
parameters are still meaningful in the multidimensional case? Since the distances are getting bigger and bigger as more and more dimensions become relevant?
Do you think that your
max_distance
andcutoff
parameters are still meaningful in the multidimensional case? Since the distances are getting bigger and bigger as more and more dimensions become relevant?
The short answer is "I don't know" but, currently, the default for both of these parameters is set relative to the distance/profile themselves. So even if there are bigger and bigger numbers, max_distance
and cutoff
would be chosen relative to those bigger numbers. It'll never be perfect but it's a reasonably safe starting point for the average user and that should always be our target audience. Am I missing something?
Yeah, I saw that it is relative to the numbers but I don't know how to generalize it to more dimensions, especially for finding the motifs with cutoff
. I think max_distance
may be possible if I choose all dimensions where the motif is present, then take the average of those dimensions and the max_distance
value would belong to the maximal distance of this averaged array (treating it as if there would be only one dimension). But I am currently absolutely clueless on how to approach the cutoff
parameter for multiple dimensions and whether it makes sense at all here.
Side question: "What do the 'pragma: no cover' comments mean?"
Yeah, I saw that it is relative to the numbers but I don't know how to generalize it to more dimensions, especially for finding the motifs with
cutoff
. I thinkmax_distance
may be possible if I choose all dimensions where the motif is present, then take the average of those dimensions and themax_distance
value would belong to the maximal distance of this averaged array (treating it as if there would be only one dimension). But I am currently absolutely clueless on how to approach thecutoff
parameter for multiple dimensions and whether it makes sense at all here.
I agree with your explanation of how to use max_distance
(i.e., essentially averaging across all of the relevant dimensions and then treating the averaged distances like a 1-D distance profile). For cutoff
, how would you feel if we made it a bit more flexible like max_distance
where a user could either:
cutoff = None
and either choose a single blanket value (computed from the current stddev approach that we have) OR in the multi-dimensional case, what if we applied the stddev approach separately/independently to each dimension (so that each dimension would have its own cutoff)?The the major difference between max_distance
and cutoff
is that we don't really know what distances we'll encounter ahead of time with max_distance
. In contrast, with cutoff
, the user has already computed the full multi-dimensional matrix profile and so that information is known and available to the user a priori and the user can decide before even calling motif()
. Because of this subtlety, this is why we don't allow users to define a function to set the cutoff (i.e., they can just compute the cutoff however they want since they have all of the information that they need to). Again, as much of the "hard" decision making to be left up to the user. I do like the idea of enforcing/allowing a separate cutoff for each dimension. Perhaps, in the multi-dimensional case, we force the user to supply a list of cutoffs (one for each dimension) and when it is absent, we pick it out ourselves using the stddev approach (one for each dimension). And as we iterate over the profile, we basically check all of the cutoffs for each dimension. How does that sound?
Side question: "What do the 'pragma: no cover' comments mean?"
Great question! So, in STUMPY, we write a ton of unit tests in order to help ensure that any code changes or public contributions do not affect our code base (i.e., any breaking changes would cause our tests to fail and then we can easily catch undesirable changes). In addition to writing unit tests, there is a concept called "code coverage", which essentially asks, "what percentage of the code base do we have unit tests for?" or "how much of our code base is 'covered' by our unit tests?". If code path (i.e., if/else branching statements) is accounted for by some test then it is possible to reach "100% code coverage". However, any lines of code that are not traversed during testing will reduce that percentage downward. As you can imagine, when you add a new mmotif()
function, there are no unit tests and so this new function is "not covered" by any unit tests and STUMPY's code coverage percentage will drop (this is not a good thing since we always aim for 100% code coverage)!
However, there are some rare cases where we know what is going on in the code and have some level of confidence that a particular line does not need to be tested. So, that is what # pragma: no cover
is for. It essentially tells the "code coverage" tool (which scans/reads the entire code base first) that that particular "branching logic" does not need to be covered. Typically, whatever is indented with that pragma
is ignored. For example, it may not be important to require coverage for, say, setting a value:
if variable is None:
variable = "I love unit testing!"
Unfortunately, the test coverage tool has no idea whether it is important or not to test this if
branch. It only "knows" that it was missed in our unit tests (i.e., not a single unit test went "inside" the if
statement) and reports it back to us. So, this is a reasonable candidate to add a # pragma: no cover
. Again, this is very rare and should be used sparingly. Let me know if that makes sense.
Let me know if that makes sense.
Yes, that makes sense. Thank you!
With your approach for cutoff
I am still not quite sure if I have understood everything correctly. I'll have to take a closer look at that the next few days. I'll first see that the function works with max_matches
and then try to extend it with the other parameters.
OR in the multi-dimensional case, what if we applied the stddev approach separately/independently to each dimension (so that each dimension would have its own cutoff)?
You mean by each dimension the individual matrix profile dimensions and not the individual time series that make up the multidimensional time series T
, right? Yes, I think that should work. Then I think I have understood it now and know roughly how I have to implement it later. I was just confused because I had only thought of the individual time series and wondered how we would get a single 'cutoff' parameter for each motif out of the relevant time series dimensions. I therefore thought that we would have several cutoff
values for a single motif and I completely forgot that this information is already stored in the matrix profile.
You mean by each dimension the individual matrix profile dimensions and not the individual time series that make up the multidimensional time series T, right?
That is precisely what I mean. So, let's say that our multi-dimensional matrix profile was:
mp = [[ 1.0, 2.0, 3.0, 4.0],
[ 100.0, 200.0, 300.0, 400.0],
[1000.0, 2000.0, 3000.0, 4000.0],
]
Then you'd compute three cutoff values (one for each row but you also prevent the cutoff from being smaller than the smallest value in each row!):
cutoff = np.nanmax(
[np.nanmean(mp, axis=1) - 2.0 * np.nanstd(mp, axis=1), np.nanmin(mp, axis=1)],
axis=0
)
And this would mean that the cuttoffs for each dimension would be:
[ 1., 100., 1000.]
Mostly because 2 stddev below the mean isn't "valid" (i.e., these are invalid cutoffs [2.63932023e-01, 2.63932023e+01, 2.63932023e+02]
). So, in our working example, only one motif is found since all other matrix profile values are above the cutoff (in each respective dimension).
Now, instead of using the default, what if I make a poor choice and I supply the cutoffs for all three dimensions and I choose [1500, 150, 15]
. This means that the bottom row (i.e., all dimensions) will NEVER be used for finding a motif since all of the matrix profile values are well above 15
and none of them can be a candidate.
Mmotifs
seems to work now with the parameters max_matches
and max_motifs
here.
I'll try to include the other parameters as soon as I can.
(The parameters max_distance
and atol
should also work now)
@SaVoAMP It looks like things are really starting to come together! Great work!
Instead posting iterations of your code here, can you please submit your branch as a pull request? This way, I can provide feedback and request changes in a documented fashion.
Here are examples of some active pull requests (PRs) that are currently being reviewed
Just added the cutoff
parameter with the corresponding operations. In essence I calculate the parameter only once for each motif and its dimension (therefore not for all dimensions).
I'm just looking how MDL works. Stupid question but why do we need to discretize the multidimensional subsequence and its nearest neighbor? Aren't time series discrete anyway since I only have a specific number of data points?
Given that there is interest for 1-D discord/motif search, it may be useful to have M-dimensional motif/discord search capabilities.
According to the last paragraph in Section IV E of the mSTAMP paper:
So, this means that it is possible to find different motifs at the same index location but with a different subset of dimensions!