Open seanlaw opened 2 years ago
@seanlaw
I think I found the source of error. According to my investigation, the problem is in the precision of $\rho$. The discrepancy is in index 2 and 12. In fact S1 = T[2:2+m]
and S2=T[12:12+m]
are exactly the same and the $\rho$ between them must be 1.
However, according to the calculation of stumpy.stump
, we can see that the highest value in $\rho[:, 2, 0]$ (print it in this line) is 0.9999999999483034
.
idx = 2 # the start index of subsequence of interest
arg = np.argmax(ρ[:, idx, 0])
>>> I[arg, 2, 0]
12 # this is NN to the subsequence at index idx=2
>>> ρ[arg, idx, 0]
0.9999999999483034
A potential solution:
~We may prevent this by creating config.STUMPY_RHO_PRECISION
and use it in the line of code I mentioned above before comparing the $\rho$ values, calculated in each thread, with each other (and then set it to 1 if the calculated value is too close to 1 (?!))~
@seanlaw
I do not know if this discrepancy (due to the numerical error) can happen in other cases as well or not... like when the true (exact) $\rho$ is not 1 (not identical) but we have difference between stumpy.stump
and naive.py/stump
due to numerical error.
How to handle identical case without using something like config.STUMPY_RHO_PRECISION
? I mean... if we try to add STUMPY_RHO_PRECISION
to config.py
and use it to replace close-to-1 $\rho$ values to 1, then it may wrongly impact other cases.
So, how about just handling identical cases? The following solution works for me BUT it is not efficient for massive data, say 10M data points. I think large-size data has higher chance of hash collision and thus increase the computing time (?)
# in core.py
def preprocess_diagonal(T, m):
...
# add a new part to find duplicates
identical = {}
for i in range(T.shape[0] - m + 1):
identical[tuple(T[i : i + m])] = []
for i in range(T.shape[0] - m + 1):
identical[tuple(T[i : i + m])].append(i)
# O(N) time complexity
T_subseq_IdenticalCode = np.full(T.shape[0] - m + 1, -1, np.int64)
for i, (key, item) in enumerate(identical.items()):
for idx in item:
T_subseq_IdenticalCode[idx] = i
# return T_subseq_IdenticalCode in addition to the other outputs
And, then we check if two subsequences are identical in stumpy._compute_diagonal
# in stumpy._compute_diagonal
if T_A_subseq_IdenticalCode[i + k] == T_A_subseq_IdenticalCode[i]:
pearson = 1.0
# or we can do it at the end to correct the values of matrix profile
An alternative way is to let collisions happen to make the process faster, and check it later.
# I think it O(N) time complexity
l = T.shape[0] - m + 1
subsequence_hash = np.empty(l , dtype=np.int64)
for i in range(l ):
subsequence_hash[i] = hash(tuple(T[i:i+m]))
# there is a chance that we might have collision but that is okay as we will check it out later.
Then, we can correct the matrix profile (i.e. matrix profile/ matrix profile left / matrix profile right) at the end:
# I think it O(N) time complexity having assumed that the z-norm distance between two subsequences is O(1).
for row in range(l):
idx = row
for col in range(3):
idx_nn = I[row, col] # use matrix profile index to find NN of idx
if subsequence_hash[idx] == subsequence_hash[idx_nn]:
# two subsequences might be identical since their hash is the same
if np.all(Tznorm[idx : idx + m] - Tznorm[idx_nn : idx_nn + m]) == 0): # check if two subsequences are actually identical
P[row, col] == 0
Is it worth it? we are increasing time in pre-processing and post-processing steps. (I believe both of these steps can be parallelized via prange
)
@NimaSarajpoor I don't think hash
is a good idea as it is slow and pushing subsequences into a Dict
may cause a lot of memory to be used as the length of the time series increases. Additionally, we'd need to account for z-normalization in these cases. Just because two subsequences aren't identical doesn't mean that their z-normalized subsequences aren't identical.
I don't know if it's worth addressing or what the best option is. Setting a config.STUMPY_RHO_PRECISION
(or some other name like config.STUMPY_PERFECT_CORRELATION = 0.9999999
) might not be the worst idea. At least, people can set it it to 1.0
if they dislike the precision that we've chosen and then we'd only need to compare if
$\rho$ > config.STUMPY_PERFECT_CORRELATION: and then shift it to
1.0` or something.
Alternatively, I wonder if we can check $\rho$ and if it is above the config.STUMPY_PERFECT_CORRELATION
then we should compute it directly using the dot product instead of using the distance from the previous diagonal.
I was wondering if what you have in mind is to check each single $\rho$ that is being calculated in stumpy.stump._compute_diagonal
. If yes, then in that case, the alternative approach
Alternatively, I wonder if we can check and if it is above the config.STUMPY_PERFECT_CORRELATION then we should compute it directly using the dot product instead of using the distance from the previous diagonal.
might (hugely ?) increase the computing time if the data mostly show periodical behavior. (because, we will have many identical subsequences)
On the other hand, I think we also want to make sure the values in final matrix profile are precise. I mean...if we say idx_nn = I[idx]
, then dist(T[idx:idx+m], T[idx_nn, idx_nn+m])
should be exact.
So, I think we can just re-calculate distance at the end after checking $\rho > config.STUMPY\_PERFECT\_CORRELATION$.
(I know we should better re-calculate it throughout the stumpy.stump._compute_diagonal
but I think this is too much for a rare case like this and, as I mentioned earlier, we may increase computing time when there are many identical subsequences. So, just correcting matrix profile at the end should be okay)
I think the only thing we'd need to change is add a line after L188 in _compute_diagonal
:
if T_B_subseq_isfinite[i + k] and T_A_subseq_isfinite[i]:
# Neither subsequence contains NaNs
if T_B_subseq_isconstant[i + k] or T_A_subseq_isconstant[i]:
pearson = 0.5
else:
pearson = cov * Σ_T_inverse[i + k] * σ_Q_inverse[i]
if pearson > config.STUMPY_PERFECT_CORRELATION:
pearson = 1.0
I think that this is the simplest and should be a negligible cost as there is an if
comparison (which should be cheap??) and then a value is set (not computed).
This should be simple, easy to maintain, easy to interpret, and if users don't like it then config.STUMPY_PERFECT_CORRELATION
is easy to understand
I think we also want to make sure the values in final matrix profile are precise
I don't know. Unfortunately, I don't think we'll get much more "precise". It's a trade off between speed and a slight loss in precision (but not much, actually). In the worst case, if you have a long sine wave (time series) OR when each subsequence has a perfect match, you'll end up computing all of the pairwise distances twice (once during _stump
and then once during post-processing if you do it your way! I believe that in my example above, it should be reasonably efficient but one should verify.
As I think about it more, if we were to do this as a post-processing step, we would write a separate function to stumpy.stump
. I would not add the post-processing step to the existing function. I think my proposal above is the way to go
I think that this is the simplest and should be a negligible cost as there is an if comparison (which should be cheap??) and then a value is set (not computed).
Correct. As you said, I think is better and enough for such rare case (you do not have to recalculate distances and you modify the values throughout the process).
As a cross reference to #655 this related test is also failing:
def test_stumpi_identical_subsequence_self_join_egress():
m = 3
seed = np.random.randint(100000)
seed = 84451
np.random.seed(seed)
@seanlaw I think you may want to see this! Apparently, it is not just for the identical subsequences. So, I created a test where I randomly scale up (* 1000) or scale down (0.001) chunks of time series.
Here is the test function:
def test_stump_imprecision():
seed = np.random.randint(100000)
np.random.seed(seed)
T = np.random.rand(64)
# randomly choose some chunks of time series and scale them up or down
T_mask = np.full(len(T), 0, dtype=bool)
k = np.random.randint(len(T) // 4, len(T))
IDX = np.random.choice(np.arange(len(T)), k, replace=False)
T_mask[IDX] = True
slices = core._get_mask_slices(T_mask)
scale = np.random.choice(np.array([0.001, 1000]), len(slices), replace=True)
for i, (start, stop) in enumerate(slices):
T[start:stop] = scale[i] * T[start:stop]
m = 3
zone = int(np.ceil(m / 4))
ref_mp = naive.stump(T, m, exclusion_zone=zone)
comp_mp = stump(T, m, ignore_trivial=True)
naive.replace_inf(ref_mp)
naive.replace_inf(comp_mp)
ref_P = ref_mp[:, 0].astype(np.float64)
comp_P = comp_mp[:, 0].astype(np.float64)
npt.assert_almost_equal(ref_P, comp_P)
And, here is the error:
Mismatched elements: 37 / 62 (59.7%)
E Max absolute difference: 0.22458425
E Max relative difference: 0.78388078
E x: array([2.6611068e-02, 7.4557520e-05, 5.2291006e-02, 3.3167208e-02,
E 1.0766382e-04, 4.5937140e-02, 4.4588591e-02, 3.0992438e-04,
E 1.8665326e-01, 4.5937140e-02, 7.9446847e-02, 1.1810141e-01,...
E y: array([2.6611068e-02, 8.7222889e-05, 7.4846320e-02, 3.3167208e-02,
E 1.3051016e-04, 4.5937140e-02, 7.4861289e-02, 3.0992438e-04,
E 1.8665326e-01, 4.5937140e-02, 7.9446847e-02, 1.1810142e-01,...
Hmm, so what do you think is the issue? I'm assuming that naive.stump
is "correct"?
I'm assuming that naive.stump is "correct"?
Yeah...I think so. I took a look at the naive.stump
and it just do z-norm on each single subsequence and then calculate the pair-wise distances. So, I assume naive.stump
should be correct.
According to my investigation, I think one problem is the imprecision in the calculation of sliding standard deviation Σ_T
, particularly when we have large numbers. (Just for the record: Even if we take std of each subseq (instead of rolling std), we can reduce the error significantly, but we still have error!)
One way to make it a little bit better is standardization on the whole time series T
(so, we scale std of each subseq by std(T)
), and then rescale back.
Another way that I think should work (I tested it to some extent) is that we refine the welford nanvar/nanstd. We can do it by calculating the rolling nanvar/nanstd for the subsequences each scaled by their minmax, and then scale back at the end.
# in stumpy/core.py
@njit
def rolling_minmax_scale(a, m):
"""
Docstring
"""
l = len(a) - m + 1
scale = np.empty(l, dtype=np.float64)
for i in range(l):
scale[i] = np.max(a[i : i + m]) - np.min(a[i : i + m])
if scale[i] == 0:
scale[i] = 1
return scale
Then, we use this to modify the welford as follows:
def _welford_nanvar(a, w, a_subseq_isfinite):
"""
Compute the rolling variance
"""
all_variances = np.empty(a.shape[0] - w + 1, dtype=np.float64)
prev_mean = 0.0
prev_var = 0.0
scale = rolling_minmax_scale(a, w) # NEW
for start_idx in range(a.shape[0] - w + 1):
prev_start_idx = start_idx - 1
stop_idx = start_idx + w # Exclusive index value
last_idx = start_idx + w - 1 # Last inclusive index value
if (
start_idx == 0
or not a_subseq_isfinite[prev_start_idx]
or not a_subseq_isfinite[start_idx]
):
curr_mean = np.nanmean(a[start_idx:stop_idx])
curr_var = np.nanvar(a[start_idx:stop_idx] / scale[start_idx]) # MODIFIED: minmax-scaled
else:
curr_mean = prev_mean + (a[last_idx] - a[prev_start_idx]) / w
curr_var = (
prev_var * np.square(scale[prev_start_idx] / scale[start_idx])
+ ((a[last_idx] - a[prev_start_idx]) / scale[start_idx])
* (
(a[last_idx] - curr_mean + a[prev_start_idx] - prev_mean)
/ scale[start_idx]
)
/ w
) # MODIFIED: reversing scale and scale again! # So, our curr_var is `var` of minmax-scaled subseq
all_variances[start_idx] = curr_var
prev_mean = curr_mean
prev_var = curr_var
return all_variances # this is now var of subsequences, where each subseq is scaled by minmax
And finally, reverse everything back to normal at the end:
def welford_nanstd(a, w=None):
"""
This a convenience wrapper around `welford_nanvar`.
"""
if w is None:
w = a.shape[0]
a_subseq_isfinite = rolling_isfinite(a, w)
scale = rolling_minmax_scale(a, w) # NEW
out = np.sqrt(np.clip(welford_nanvar(a, w), a_min=0, a_max=None))
out[a_subseq_isfinite] = scale[a_subseq_isfinite] * out[a_subseq_isfinite] # scale back to normal
return out
For instance, I am comparing the proposed one with the existing version:
test_stump_imprecision
(seed = 0) --> existing version: max abs diff: 0.36033988, proposed welford approach: 0.09287705@seanlaw In another test that I designed, the proposed welford performed worse. So, you may ignore the idea (I just did not delete it for our future ref)
Yeah, I'm pretty confident that Welford is about as precise as we're going to get and it has the best speed to precision tradeoff. 😢
@seanlaw As you mentioned, it seems that welford is okay. I checked its test function and tried it on the time series of the test functions.
I think I found the issue. Note that in stumpy/stump.py
, we do:
And, it seems that the high precision in std
does not guarantee the high precision of its inverse when std
is small.
Example:
>>> T
array([2.42145823e-01, 2.88656701e-04, 9.34900364e-04, 9.34545813e-04,
9.36884255e-04, 3.72888654e-04, 3.99196666e-04, 9.97234658e-04,
2.04350128e-04, 8.63723288e-01, 1.86409747e-01, 2.88656701e+02,
9.34900364e+02, 9.34545813e+02, 9.36884255e+02, 3.72888654e+02,
3.99196666e+02, 9.97234658e+02, 2.04350128e+02, 6.26529775e-01])
Let's take a look at std of this input T
, with m=3
, at index 2 (i.e. for subseq T[2 : 5])
# ref regards to exact std
# comp regards to welford std
np.abs(ref_std[2] - comp_std[2]) = 1.2878556459058026e-12
np.abs(ref_std_inv[2] - comp_std_inv[2]) = 1.2162515548989177
And, if we compare the matrix profile ref_P (from naive.stump
) and comp_P(from stumpy.stump
), we can see that the npt.assert_almost_equal(...)
fails at index 2 and 12.
(The indices 2 and 12 are in fact the start index of two identical subsequences where one of them is scaled down by 0.001 (the subseq at index 2), and the other one is scaled up by 1000 (the subseq at index 12))
Ohh, wow! That's REALLY bad but also fun at the same time 😭
Yeah...I also tried many many different ways to resolve this. I have not been successful so far.
FYI:
(The indices 2 and 12 are in fact the start index of two identical subsequences where one of them is scaled down by 0.001 (the subseq at index 2), and the other one is scaled up by 1000 (the subseq at index 12))
If we scale up (down) by 10000 (0.0001), then:
np.abs(ref_std_inv[2] - comp_std_inv[2])
will be around 1000! (btw, we should also consider the impact of cov
as well in calculating pearson as shown in the line of code provided in my previous comment)
The exact distance between subseq_2
and subseq_12
is 1e-16, and the stump (which is based on pearson) gives about 0.02.
np.abs(ref_std[2] - comp_std[2]) = 1.2878556459058026e-12 np.abs(ref_std_inv[2] - comp_std_inv[2]) = 1.2162515548989177
What happens when you print (1.0 / ref_std[2])
and print(1.0 / comp_std[2])
?
Can you also print ref_std[2]
and comp_std[2]
? Or give me the seed to generate T
precisely. I tried to copy/paste your T
above but am not seeing the issue.
What happens when you
print (1.0 / ref_std[2])
andprint(1.0 / comp_std[2])
?
>>> 1.0 / ref_std[2]
971803.3300204428
>>> 1.0 / comp_std[2]
971802.1137688879
In case that matters:
Since we scale the whole identical seq by 0.001
, we are scaling down its standard deviation as well, so std(scaled_seq)
is equal to 0.001 * std(seq)
. So, we are dealing with smaller std, and hence, its inverse becomes large. So let's say an actual value is $a$, and the value with imprecision is $a + \epsilon$, then, the imprecision of the inverse is:
$\frac{1}{a} - \frac{1}{a + \epsilon} = \frac{\epsilon}{a(a + \epsilon)}$, so, if $a$ is small itself, the imprecision of the inverse becomes large.
Can you tell me what seed and code you are using to generate T
?
I think this should give you the T
you are looking for.
def identical_cases_with_scale():
m = 3
zone = int(np.ceil(m / 4))
seed = 27343
np.random.seed(seed)
identical = np.random.rand(8)
T_A = np.random.rand(20)
T_A[1 : 1 + identical.shape[0]] = identical * 0.001
T_A[11 : 11 + identical.shape[0]] = identical * 1000
ref_mp = stump(T_A, m, exclusion_zone=zone, row_wise=True)
comp_mp = stumpy.stump(T_A, m, ignore_trivial=True)
replace_inf(ref_mp)
replace_inf(comp_mp)
return T_A, ref_mp, comp_mp
Hmm, I wonder in compute_mean_std
, if AFTER we compute Σ_T
(stddev), would be worth:
The reason why we use Welford is because of its speed and reasonable precision. As you identified, for the most part, what Welford produces is fine. However, for really small numbers, it's insufficient. And so maybe we add a correction where we use np.std
for those specific windows?
Any thoughts?
So, after line #869 in core.compute_mean_stddev
:
we can add something like this:
idx = np.where(Σ_T < config.STUMPY_STDDEV_CORRECTION_THRESHOLD) # config.STUMPY_STDDEV_CORRECTION_THRESHOLD = 1e-5
if len(idx) > 0:
Σ_T[idx] = np.std(core.rolling_window(T, m)[idx])
This might not work for the multi-dimensional case but I hope you can get my point
This might not work for the multi-dimensional case but I hope you can get my point
After I saw your suggestion, I tried it by calculating an upper bound for std, std_ub
, in welford
and then I checked if the std_ub
is below the threshold 1e-5. In such case, I simply did np.nanvar(..)
. This fixes the issue and now the test function with scaled identical sequences pass.
Now, there is one more thing we might be interested in taking care of. See test function below:
def test_stump_volatile_case1():
seed = 0
np.random.seed(seed)
T = np.random.rand(64)
T_mask = np.full(len(T), 0, dtype=bool)
k = np.random.randint(len(T) // 4, len(T))
IDX = np.random.choice(np.arange(len(T)), k, replace=False)
T_mask[IDX] = True
slices = core._get_mask_slices(T_mask)
scale = np.random.choice(np.array([0.001, 1000]), len(slices), replace=True)
for i, (start, stop) in enumerate(slices):
T[start:stop] = scale[i] * T[start:stop]
m = 3
zone = int(np.ceil(m / 4))
ref_mp = naive.stump(T, m, exclusion_zone=zone)
comp_mp = stump(T, m, ignore_trivial=True)
naive.replace_inf(ref_mp)
naive.replace_inf(comp_mp)
ref_P = ref_mp[:, 0].astype(np.float64)
comp_P = comp_mp[:, 0].astype(np.float64)
npt.assert_almost_equal(ref_P, comp_P)
Here, I randomly scaled up or down the values.
btw, I just realized a cleaner way was to do:
scale = np.random.choice(np.array([0.001, 1, 1000]), len(T), replace=True)
And then:
T = T * scale
Okay, let's get back to our main topic. This test function still fails. In fact, I used the exact mean and std, i.e. np.nanmean(rolling_window(...), axis=1)
and np.sqrt(np.nanvar(rolling_window(...), axis=1))
! I wanted to make sure I am not inserting any inaccuracy from these two parameters.
Here is the error:
Mismatched elements: 11 / 62 (17.7%)
E Max absolute difference: 0.03931141
E Max relative difference: 0.65784229
This shows that the pearson approach, i.e. using rolling cov
, is probably another source of imprecision. I haven't investigated this latter test case. so I don't know what is happening behind the scene!
Side note:
It is worthwhile to note that the matrix profile indices are the same in this case. (So, we can keep in mind that our last resort can be recalculating distances based on the nearest-neighbor indices provided in I
. Similar to what we did in stumpi
to retrieve the left matrix profile values)
@seanlaw
Btw, I would like to mention that the current solution, i.e. checking std and re-calculating its exact value
may not be an efficient solution if ALL VALUES of time series are small, e.g. T=[0.001, 0.003, -0.002, 0 0.001, ....]
. In this cases, we are going to end up with calculating the exact value of std for all subseqs (so, not fast)! I know this is a very rare case, but I just wanted to inform you of this case. I have some ideas but they are not well developed :)
Yeah, I don't think we should move away from using welford because it is not only reasonably precise but it doesn't use up all of our memory. With nanstd
, it can use a ton of memory when we have a sufficiently large T
. At some point, we just can't fix all precision issues. We can try our best and leave it at that. We also have to ask ourselves how likely are we to come across strange time series data like the contrived examples that we have?
Alternatively, can we somehow avoid the inverse and still use cov
?
We also have to ask ourselves how likely are we to come across strange time series data like the contrived examples that we have?
Correct! I think the current version is already good and because of that we should just try to improve it without drastically sacrificing the speed.
Can we somehow avoid the inverse
Yeah...I think that might help if it is possible. I thought about it and even tried to re-arrange the equations in the code but that did not help. I will think about it again.
Update
This might not work for the multi-dimensional case but I hope you can get my point
After I saw your suggestion, I tried it by calculating an upper bound for std,
std_ub
, inwelford
and then I checked if thestd_ub
is below the threshold 1e-5. In such case, I simply didnp.nanvar(..)
. This fixes the issue and now the test function with scaled identical sequences pass.Now, there is one more thing we might be interested in taking care of. See test function below:
def test_stump_volatile_case1():
the test function def test_stump_volatile_case1():
fails even when mean and std are calculated naively AND fastmath
is turned off.
the test function def test_stump_volatile_case1(): fails even when mean and std are calculated naively AND fastmath is turned off.
Can you help me understand what that means and if there is anything we could do?
Can you help me understand what that means and if there is anything we could do?
So I did fastmath=False
to remove its impact on imprecision if there is any. I also avoided using the rolling approach for calculating "mean" and "std" of subsequences. And.... I still got error.
I am still investigating this. I just wanted to update you and seek your opinion.
So, naive
computation of mean and stddev just means that we are using np.mean
and np.std
. This suggests that the issue for test_stump_volatile_case1()
is coming from somewhere else and may be unrelated to the precision of np.mean
and np.std
(as it relates to computing the inverse stddev as we had discussed earlier).
Is that the correct interpretation?
Yes...exactly.
According to my investigation, I realized the error (in that case) is coming from the error of cov
. The error is about 1e-11; however, the std
is small (and hence its inverse is large) and that results in considerable imprecision in pearson.
Also:
$\rho{ij} = \frac{cov{ij}}{\sigma{i} \sigma{j}}$ and we know that $-1 \le \rho{ij} \le 1$. That means: $- \sigma{i} \sigma{j} \le cov{ij} \le \sigma{i} \sigma{j}$
So, whenever we get small value for $\sigma{i} \sigma{j}$, the $|cov|$ becomes small as well. Let's consider two cases for now:
ref
/ comp
of the test function, we can see this is the cause of error in this case!)So, one idea can be:
# in stumpy/stump.py
std_combined = Σ_T[i + k] * σ_Q[i]
if std_combined < 1e-8:
# calculate `cov` using direct approach and not rolling appraoch
pearson = cov / std_combined
(I did a similar thing. I did if cov < 1e-8: #calculate cov directly
and it worked; however, I think std_combined < 1e-8
is better IF it works)
However, that means if the input time series T
contains many many almost-flat subsequences, this can result in higher computing time.
Interesting!
I did a similar thing. I did if cov < 1e-8: #calculate cov directly and it worked; however, I think std_combined < 1e-8 is better IF it works
Wouldn't std_combined = Σ_T[i + k] * σ_Q[i]
require passing Σ_T
into _compute_diagonal
? I think that undesirable. Unless you were thinking about putting it into a different function? Do you think this will significantly impact the readability of the code?
It is worthwhile to note that the matrix profile indices are the same in this case. (So, we can keep in mind that our last resort can be recalculating distances based on the nearest-neighbor indices provided in I. Similar to what we did in stumpi to retrieve the left matrix profile values)
Is this always true? Is it possible to have the wrong index if the distance is wrong? If not, then maybe as a post-processing step we simply compute the $\frac{1}{\sigma_Q[i]\Sigma_T[j]}$ for the final pairs of nearest neighbors and then only recompute the distances where they are below some threshold. Again, one would need to figure out if the indices are always correct regardless of the initial distances.
However, that means if the input time series T contains many many almost-flat subsequences, this can result in higher computing time.
Frankly, if the user's data contains mostly of almost-flat subsequences then it's probably a really, really poor dataset to begin with for matrix profiles in general 😢
however, I think
std_combined < 1e-8
is better IF it works)
I tried my proposed solution (i.e. using std_combined < 1e-8
instead of if cov < 1e-8)
and it failed! I will let you know if I fix it.
Wouldn't std_combined = Σ_T[i + k] * σ_Q[i] require passing Σ_T into _compute_diagonal? Do you think this will significantly impact the readability of the code?
I don't think so. It seems we are using Σ_T_inverse
and σ_Q_inverse
just for calculating pearson and nothing else. So, instead of passing Σ_T_inverse
and σ_Q_inverse
, we should be able to just pass Σ_T
and σ_Q
into _stump
and then into _compute_PI
.
Is this always true?
Well, it was true at the time I was writing that comment :) If I remember correctly, in one of the cases through my investigation, I faced a case where the indices of ref
and comp
were different. I will see if I can find that case again. If the indices are different but their corresponding distances are close, then we might be able to only do post-processing still.
Frankly, if the user's data contains mostly of almost-flat subsequences then it's probably a really, really poor dataset to begin with for matrix profiles in general 😢
Yeah...I meant if someone has T
with 10M data points and it has two almost-flat subsequences with length 500k
. Then, at some point throughout the calculation of matrix profile, calculating the cov directly might increase the time. But, as you said, it is not possible(?) to take care of everything.
I don't think so. It seems we are using Σ_T_inverse and σ_Q_inverse just for calculating pearson and nothing else. So, instead of passing Σ_T_inverse and σ_Q_inverse, we should be able to just pass Σ_T and σ_Q into _stump and then into _compute_PI.
Well, the reason we are passing Σ_T_inverse
and σ_Q_inverse
into _compute_diagonal
is so that we don't need to repeatedly re-compute the inverse which should save us a bit of time. In the case where all subsequences are finite and none of the subsequences are constant then this means that pearson = cov * Σ_T_inverse[i + k] * σ_Q_inverse[i]
is always being used. If we choose not to use inverse
then we should assess the computational cost of doing inverse
on-the-fly vs pre-computing it.
Well, the reason we are passing
Σ_T_inverse
andσ_Q_inverse
into_compute_diagonal
is so that we don't need to repeatedly re-compute the inverse which should save us a bit of time
Ah...right!
If we choose not to use
inverse
then we should assess the computational cost of doinginverse
on-the-fly vs pre-computing i
Correct. Well, maybe I should do this first and then if I realize that the cost is negligible, I can try to work on imprecision issue.
Sounds good!
So, I have been working on this from time to time. There are a couple of things that I would like to share.
(1) Maybe I am missing something but it seems there is a small inconsistency between the outputs of naive.compute_mean_std
and core.compute_mean_std
, and core.preprocess
.
# input:
T = np.array([0, np.nan, 2, 3, 4, 5, 6, 7, np.inf, 9])
m = 3
And, the sliding standard deviation:
>>> Σ_T_naive # from `naive.compute_mean_std`
array([1. , 0.5 , 0.81649658, 0.81649658, 0.81649658,
0.81649658, 0.5 , 1. ])
>>> Σ_T_compute # from `core.compute_mean_std`
array([1. , 0.5 , 0.81649658, 0.81649658, 0.81649658,
0.81649658, 0. , 0. ])
>>> Σ_T_preprocess # from `core.preprocess`
array([1. , 0.5 , 0.81649658, 0.81649658, 0.81649658,
0.81649658, 0.5 , 1. ])
Also, note that the second element and one-before-last element of T
are both not finite
. So, it might be better to set the std of their corresponding subsequences to 0
(and then those 0
s will be replaced with 1
since we need to get their inverse).
So, I did:
is_finite = core.rolling_isfinite(T, m) # at the beginning of process
Σ_T[~is_finite] = 0 # at the end of process
And, it works. I tested test_core
and test_stump
.
(2) I also tried a more-expensive approach to calculate Σ_T. In short, I did something like this:
T_sqr = np.square(T)
M_T_sqr = # calculate sliding mean for np.square(T)
var_T = M_T_sqr - np.square(M_T)
Σ_T = np.sqrt(var_T)
I didn't provide the check for inf
/ nan
values in the code snippet above; however, I considered them when I was implementing it. I tested and it works, and it can pass the test case where there are two identical subseqs, and one of them is multiplied by 1000, and the other by 0.001 (see this test function in PR#657)
Regarding the computing time, for n=100_000; m=50
, it increases the time from 0.005
to 0.01
. Note that the stump
process itself is about 14sec (if I remember correctly). So, this is negligible. For n=1M, m=100
, it increase the time from 0.067
to 0.1
. And, for n=10M, m=500
, it increases the time from 1.68
to 3.06
. It is almost 100% increase in time. However, compared to the main process stump
. this is negligible.
What do you think?
update
side note: I think my approach, i.e. using sliding mean of np.square(T)
to calculate Σ_T
, might(?) result in error if elements of T
are large values. I have not investigated it though. But first, I just want to know how you feel about this approach.
Maybe I am missing something but it seems there is a small inconsistency between the outputs of naive.compute_mean_std and core.compute_mean_std, and core.preprocess.
Maybe I'm not seeing it but it looks like naive.compute_mean_std
and core.preprocess
produce the same results, which is expected. In both of those cases, all np.inf
are converted to np.nan
first and then the mean and stddev are calculated. However, core.compute_mean_std
only computes the pure mean and stddev and should be the same as np.nanmean
and np.nanstd
.
The source of the difference is when you allow np.inf
to exist when computing, say, stddev. If I recall correctly, numpy
simply ignores/removes the np.inf
and computs the stddev using the remaining non-np.inf
values. So, 0.5
comes from np.std([6, 7]) == np.std([6, 7, np.inf])
and the 1.0
comes from np.std([7, 9]) == np.std([7, np.inf, 9])
So, there is no imprecision.
Also, note that the second element and one-before-last element of T are both not finite. So, it might be better to set the std of their corresponding subsequences to 0 (and then those 0s will be replaced with 1 since we need to get their inverse).
We currently handle this in core.preprocess_diagonal
I also tried a more-expensive approach to calculate Σ_T. In short, I did something like this:
It's been a while and I don't remember what happened but I believe that this approach may lead to "catastrophic cancellation" and even worse precision issues.
side note: I think my approach, i.e. using sliding mean of np.square(T) to calculate Σ_T, might(?) result in error if elements of T are large values. I have not investigated it though. But first, I just want to know how you feel about this approach.
Exactly! This stuff is really nasty :(
Maybe I am missing something but it seems there is a small inconsistency between the outputs of naive.compute_mean_std and core.compute_mean_std, and core.preprocess.
Maybe I'm not seeing it but it looks like
naive.compute_mean_std
andcore.preprocess
produce the same results, which is expected. In both of those cases, allnp.inf
are converted tonp.nan
first and then the mean and stddev are calculated. However,core.compute_mean_std
only computes the pure mean and stddev and should be the same asnp.nanmean
andnp.nanstd
.The source of the difference is when you allow
np.inf
to exist when computing, say, stddev. If I recall correctly,numpy
simply ignores/removes thenp.inf
and computs the stddev using the remaining non-np.inf
values. So,0.5
comes fromnp.std([6, 7]) == np.std([6, 7, np.inf])
and the1.0
comes fromnp.std([7, 9]) == np.std([7, np.inf, 9])
So, there is no imprecision.
Also, note that the second element and one-before-last element of T are both not finite. So, it might be better to set the std of their corresponding subsequences to 0 (and then those 0s will be replaced with 1 since we need to get their inverse).
We currently handle this in
core.preprocess_diagonal
Thanks for the explanation! It makes sense. I can see it now!
I also tried a more-expensive approach to calculate Σ_T. In short, I did something like this:
It's been a while and I don't remember what happened but I believe that this approach may lead to "catastrophic cancellation" and even worse precision issues.
side note: I think my approach, i.e. using sliding mean of np.square(T) to calculate Σ_T, might(?) result in error if elements of T are large values. I have not investigated it though. But first, I just want to know how you feel about this approach.
Exactly! This stuff is really nasty :(
Is that okay if I go ahead and try some changes and push them to the existing PR (#657) and see how it goes? (I usually test them on some important test functions. Then I can push the commits and then test it on all test functions. You can also see the code)
T_sqr = np.square(T) M_T_sqr = # calculate sliding mean for np.square(T) var_T = M_T_sqr - np.square(M_T) Σ_T = np.sqrt(var_T)
Can you use LaTeX and write out the exact formulas or provide the full code. This looks like some things are missing?
Is that okay if I go ahead and try some changes and push them to the existing PR (https://github.com/TDAmeritrade/stumpy/pull/657) and see how it goes? (I usually test them on some important test functions. Then I can push the commits and then test it on all test functions. You can also see the code)
Sure
Can you use LaTeX and write out the exact formulas or provide the full code. This looks like some things are missing?
In below, I am calculating the variance, $Var$, for a subsequence of length m
. $\mu$ is the mean of subsequence. $\mu'$ is the mean of square of subsequence. (so, $\mu'$ is just np.mean(np.square(Q))
, where Q is a subsequence with length m
)
$Var = E[(x-\mu)^{2}]$ $Var = E[x^{2} - 2\mu x + \mu^{2}]$ $Var = E[x^{2}] - 2\mu E[x] + \mu^{2}$ $Var = E[x^{2}] - 2\mu E[x] + \mu^{2}$ $Var = \mu' - 2\mu \cdot \mu + \mu^{2}$ $Var = \mu' - \mu^{2}$
So, the variance is mean of square-of-subseq minus square of mean of subseq
(this is the same as what provided in the wikipedia.)
So, I can use the sliding mean of square of T
and the sliding mean of T
to calculate the sliding Var
. I think the error might be less. I need to investigate it further.
Downside:
If the values of T
gets REALLY large, then square of T
becomes a series that contains even larger values, and hence maybe(?) worse precision. Again, I need to investigate it. So far, this approach passes test_core
, test_stump
and test_stumped
(which was failing in PR #657)
or provide the full code.
I will push the changes to the PR #657.
What happens when we try this test case with the Matlab code? Do they experience any issues with precision? If so, then maybe we can email Eamonn about it and see if he has any suggestions.
This unit test causes a failure: