TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis
https://stumpy.readthedocs.io/en/latest/
Other
3.08k stars 286 forks source link

Fix Precision in Unit Test #610

Open seanlaw opened 2 years ago

seanlaw commented 2 years ago

This unit test causes a failure:

import numpy as np
import numpy.testing as npt
import naive
from stumpy import stump, config
import pandas as pd

def test_stump_identical_subsequence_self_join():
    seed = 27343
    np.random.seed(seed)

    identical = np.random.rand(8)
    T_A = np.random.rand(20)
    T_A[1 : 1 + identical.shape[0]] = identical
    T_A[11 : 11 + identical.shape[0]] = identical
    m = 3
    zone = int(np.ceil(m / 4))
    ref_mp = naive.stump(T_A, m, exclusion_zone=zone, row_wise=True)
    comp_mp = stump(T_A, m, ignore_trivial=True)
    naive.replace_inf(ref_mp)
    naive.replace_inf(comp_mp)
    npt.assert_almost_equal(
        ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
    )  # ignore indices

    comp_mp = stump(pd.Series(T_A), m, ignore_trivial=True)
    naive.replace_inf(comp_mp)
    npt.assert_almost_equal(
        ref_mp[:, 0], comp_mp[:, 0], decimal=config.STUMPY_TEST_PRECISION
    )  # ignore indices
NimaSarajpoor commented 2 years ago

@seanlaw I think I found the source of error. According to my investigation, the problem is in the precision of $\rho$. The discrepancy is in index 2 and 12. In fact S1 = T[2:2+m] and S2=T[12:12+m] are exactly the same and the $\rho$ between them must be 1.

However, according to the calculation of stumpy.stump, we can see that the highest value in $\rho[:, 2, 0]$ (print it in this line) is 0.9999999999483034.

idx = 2 # the start index of subsequence of interest
arg = np.argmax(ρ[:, idx, 0]) 
>>> I[arg, 2, 0]
12 # this is NN to the subsequence at index idx=2

>>> ρ[arg, idx, 0]
0.9999999999483034

A potential solution: ~We may prevent this by creating config.STUMPY_RHO_PRECISION and use it in the line of code I mentioned above before comparing the $\rho$ values, calculated in each thread, with each other (and then set it to 1 if the calculated value is too close to 1 (?!))~

NimaSarajpoor commented 2 years ago

@seanlaw I do not know if this discrepancy (due to the numerical error) can happen in other cases as well or not... like when the true (exact) $\rho$ is not 1 (not identical) but we have difference between stumpy.stump and naive.py/stump due to numerical error.


How to handle identical case without using something like config.STUMPY_RHO_PRECISION? I mean... if we try to add STUMPY_RHO_PRECISION to config.py and use it to replace close-to-1 $\rho$ values to 1, then it may wrongly impact other cases.

So, how about just handling identical cases? The following solution works for me BUT it is not efficient for massive data, say 10M data points. I think large-size data has higher chance of hash collision and thus increase the computing time (?)

# in core.py 

def preprocess_diagonal(T, m):
   ...
   # add a new part to find duplicates
   identical = {}
   for i in range(T.shape[0] - m + 1):
       identical[tuple(T[i : i + m])] = []  
   for i in range(T.shape[0] - m + 1): 
       identical[tuple(T[i : i + m])].append(i) 

   # O(N) time complexity
   T_subseq_IdenticalCode = np.full(T.shape[0] - m + 1, -1, np.int64)
   for i, (key, item) in enumerate(identical.items()): 
       for idx in item:
           T_subseq_IdenticalCode[idx] = i

   # return T_subseq_IdenticalCode in addition to the other outputs

And, then we check if two subsequences are identical in stumpy._compute_diagonal

# in stumpy._compute_diagonal
if T_A_subseq_IdenticalCode[i + k] == T_A_subseq_IdenticalCode[i]:
     pearson = 1.0

# or we can do it at the end to correct the values of matrix profile

An alternative way is to let collisions happen to make the process faster, and check it later.

# I think it O(N) time complexity

l = T.shape[0]  - m + 1
subsequence_hash = np.empty(l , dtype=np.int64)
for i in range(l ):
    subsequence_hash[i] = hash(tuple(T[i:i+m])) 

# there is a chance that we might have collision but that is okay as we will check it out later.

Then, we can correct the matrix profile (i.e. matrix profile/ matrix profile left / matrix profile right) at the end:

# I think it O(N) time complexity having assumed that the z-norm distance between two subsequences is O(1).

for row in range(l):
   idx = row
   for col in range(3): 
       idx_nn = I[row, col] # use matrix profile index to find NN of idx
       if subsequence_hash[idx] == subsequence_hash[idx_nn]:
             # two subsequences might be identical since their hash is the same
             if np.all(Tznorm[idx : idx + m] - Tznorm[idx_nn : idx_nn + m]) == 0): # check if two subsequences are actually identical
                  P[row, col] == 0

Is it worth it? we are increasing time in pre-processing and post-processing steps. (I believe both of these steps can be parallelized via prange)

seanlaw commented 2 years ago

@NimaSarajpoor I don't think hash is a good idea as it is slow and pushing subsequences into a Dict may cause a lot of memory to be used as the length of the time series increases. Additionally, we'd need to account for z-normalization in these cases. Just because two subsequences aren't identical doesn't mean that their z-normalized subsequences aren't identical.

I don't know if it's worth addressing or what the best option is. Setting a config.STUMPY_RHO_PRECISION (or some other name like config.STUMPY_PERFECT_CORRELATION = 0.9999999) might not be the worst idea. At least, people can set it it to 1.0 if they dislike the precision that we've chosen and then we'd only need to compare if $\rho$ > config.STUMPY_PERFECT_CORRELATION: and then shift it to1.0` or something.

Alternatively, I wonder if we can check $\rho$ and if it is above the config.STUMPY_PERFECT_CORRELATION then we should compute it directly using the dot product instead of using the distance from the previous diagonal.

NimaSarajpoor commented 2 years ago

I was wondering if what you have in mind is to check each single $\rho$ that is being calculated in stumpy.stump._compute_diagonal. If yes, then in that case, the alternative approach

Alternatively, I wonder if we can check and if it is above the config.STUMPY_PERFECT_CORRELATION then we should compute it directly using the dot product instead of using the distance from the previous diagonal.

might (hugely ?) increase the computing time if the data mostly show periodical behavior. (because, we will have many identical subsequences)


On the other hand, I think we also want to make sure the values in final matrix profile are precise. I mean...if we say idx_nn = I[idx], then dist(T[idx:idx+m], T[idx_nn, idx_nn+m]) should be exact.

So, I think we can just re-calculate distance at the end after checking $\rho > config.STUMPY\_PERFECT\_CORRELATION$. (I know we should better re-calculate it throughout the stumpy.stump._compute_diagonal but I think this is too much for a rare case like this and, as I mentioned earlier, we may increase computing time when there are many identical subsequences. So, just correcting matrix profile at the end should be okay)

seanlaw commented 2 years ago

I think the only thing we'd need to change is add a line after L188 in _compute_diagonal:

https://github.com/TDAmeritrade/stumpy/blob/576de379994896063be71d264e48a08ae17476cf/stumpy/stump.py#L183-L188

if T_B_subseq_isfinite[i + k] and T_A_subseq_isfinite[i]: 
    # Neither subsequence contains NaNs 
    if T_B_subseq_isconstant[i + k] or T_A_subseq_isconstant[i]: 
        pearson = 0.5 
    else: 
        pearson = cov * Σ_T_inverse[i + k] * σ_Q_inverse[i] 
        if pearson > config.STUMPY_PERFECT_CORRELATION:
            pearson = 1.0

I think that this is the simplest and should be a negligible cost as there is an if comparison (which should be cheap??) and then a value is set (not computed).

This should be simple, easy to maintain, easy to interpret, and if users don't like it then config.STUMPY_PERFECT_CORRELATION is easy to understand

I think we also want to make sure the values in final matrix profile are precise

I don't know. Unfortunately, I don't think we'll get much more "precise". It's a trade off between speed and a slight loss in precision (but not much, actually). In the worst case, if you have a long sine wave (time series) OR when each subsequence has a perfect match, you'll end up computing all of the pairwise distances twice (once during _stump and then once during post-processing if you do it your way! I believe that in my example above, it should be reasonably efficient but one should verify.

seanlaw commented 2 years ago

As I think about it more, if we were to do this as a post-processing step, we would write a separate function to stumpy.stump. I would not add the post-processing step to the existing function. I think my proposal above is the way to go

NimaSarajpoor commented 2 years ago

I think that this is the simplest and should be a negligible cost as there is an if comparison (which should be cheap??) and then a value is set (not computed).

Correct. As you said, I think is better and enough for such rare case (you do not have to recalculate distances and you modify the values throughout the process).

seanlaw commented 1 year ago

As a cross reference to #655 this related test is also failing:

def test_stumpi_identical_subsequence_self_join_egress():
    m = 3

    seed = np.random.randint(100000)
    seed = 84451
    np.random.seed(seed)
NimaSarajpoor commented 1 year ago

@seanlaw I think you may want to see this! Apparently, it is not just for the identical subsequences. So, I created a test where I randomly scale up (* 1000) or scale down (0.001) chunks of time series.

Here is the test function:

def test_stump_imprecision():
    seed = np.random.randint(100000)
    np.random.seed(seed)

    T = np.random.rand(64)

    # randomly choose some chunks of time series and scale them up or down 
    T_mask = np.full(len(T), 0, dtype=bool)
    k = np.random.randint(len(T) // 4, len(T))
    IDX = np.random.choice(np.arange(len(T)), k, replace=False)
    T_mask[IDX] = True
    slices = core._get_mask_slices(T_mask)
    scale = np.random.choice(np.array([0.001, 1000]), len(slices), replace=True)
    for i, (start, stop) in enumerate(slices):
        T[start:stop] = scale[i] * T[start:stop]

    m = 3
    zone = int(np.ceil(m / 4))
    ref_mp = naive.stump(T, m, exclusion_zone=zone)
    comp_mp = stump(T, m, ignore_trivial=True)
    naive.replace_inf(ref_mp)
    naive.replace_inf(comp_mp)

    ref_P = ref_mp[:, 0].astype(np.float64)
    comp_P = comp_mp[:, 0].astype(np.float64)

    npt.assert_almost_equal(ref_P, comp_P)

And, here is the error:

 Mismatched elements: 37 / 62 (59.7%)
E       Max absolute difference: 0.22458425
E       Max relative difference: 0.78388078
E        x: array([2.6611068e-02, 7.4557520e-05, 5.2291006e-02, 3.3167208e-02,
E              1.0766382e-04, 4.5937140e-02, 4.4588591e-02, 3.0992438e-04,
E              1.8665326e-01, 4.5937140e-02, 7.9446847e-02, 1.1810141e-01,...
E        y: array([2.6611068e-02, 8.7222889e-05, 7.4846320e-02, 3.3167208e-02,
E              1.3051016e-04, 4.5937140e-02, 7.4861289e-02, 3.0992438e-04,
E              1.8665326e-01, 4.5937140e-02, 7.9446847e-02, 1.1810142e-01,...
seanlaw commented 1 year ago

Hmm, so what do you think is the issue? I'm assuming that naive.stump is "correct"?

NimaSarajpoor commented 1 year ago

I'm assuming that naive.stump is "correct"?

Yeah...I think so. I took a look at the naive.stump and it just do z-norm on each single subsequence and then calculate the pair-wise distances. So, I assume naive.stump should be correct.


According to my investigation, I think one problem is the imprecision in the calculation of sliding standard deviation Σ_T, particularly when we have large numbers. (Just for the record: Even if we take std of each subseq (instead of rolling std), we can reduce the error significantly, but we still have error!)


One way to make it a little bit better is standardization on the whole time series T (so, we scale std of each subseq by std(T)), and then rescale back.

Another way that I think should work (I tested it to some extent) is that we refine the welford nanvar/nanstd. We can do it by calculating the rolling nanvar/nanstd for the subsequences each scaled by their minmax, and then scale back at the end.

# in stumpy/core.py
@njit
def rolling_minmax_scale(a, m):
    """
    Docstring
    """
    l = len(a) - m + 1
    scale = np.empty(l, dtype=np.float64)
    for i in range(l):
        scale[i] = np.max(a[i : i + m]) - np.min(a[i : i + m])
       if scale[i] == 0:
          scale[i] = 1

    return scale

Then, we use this to modify the welford as follows:

def _welford_nanvar(a, w, a_subseq_isfinite):
    """
    Compute the rolling variance 
    """
    all_variances = np.empty(a.shape[0] - w + 1, dtype=np.float64)
    prev_mean = 0.0
    prev_var = 0.0

    scale = rolling_minmax_scale(a, w) # NEW

    for start_idx in range(a.shape[0] - w + 1):
        prev_start_idx = start_idx - 1
        stop_idx = start_idx + w  # Exclusive index value
        last_idx = start_idx + w - 1  # Last inclusive index value

        if (
            start_idx == 0
            or not a_subseq_isfinite[prev_start_idx]
            or not a_subseq_isfinite[start_idx]
        ):
            curr_mean = np.nanmean(a[start_idx:stop_idx])
            curr_var = np.nanvar(a[start_idx:stop_idx] / scale[start_idx]) # MODIFIED: minmax-scaled
        else:
            curr_mean = prev_mean + (a[last_idx] - a[prev_start_idx]) / w

            curr_var = (
                prev_var * np.square(scale[prev_start_idx] / scale[start_idx])
                + ((a[last_idx] - a[prev_start_idx]) / scale[start_idx])
                * (
                    (a[last_idx] - curr_mean + a[prev_start_idx] - prev_mean)
                    / scale[start_idx]
                )
                / w
            ) # MODIFIED: reversing scale and scale again! # So, our curr_var is `var` of minmax-scaled subseq

        all_variances[start_idx] = curr_var

        prev_mean = curr_mean
        prev_var = curr_var

    return all_variances # this is now var of subsequences, where each subseq is scaled by minmax

And finally, reverse everything back to normal at the end:

def welford_nanstd(a, w=None):
    """
    This a convenience wrapper around `welford_nanvar`.
     """
    if w is None:
        w = a.shape[0]

    a_subseq_isfinite = rolling_isfinite(a, w)
    scale = rolling_minmax_scale(a, w) # NEW

    out = np.sqrt(np.clip(welford_nanvar(a, w), a_min=0, a_max=None))
    out[a_subseq_isfinite] = scale[a_subseq_isfinite] * out[a_subseq_isfinite] # scale back to normal

    return out

For instance, I am comparing the proposed one with the existing version:

NimaSarajpoor commented 1 year ago

@seanlaw In another test that I designed, the proposed welford performed worse. So, you may ignore the idea (I just did not delete it for our future ref)

seanlaw commented 1 year ago

Yeah, I'm pretty confident that Welford is about as precise as we're going to get and it has the best speed to precision tradeoff. 😢

NimaSarajpoor commented 1 year ago

@seanlaw As you mentioned, it seems that welford is okay. I checked its test function and tried it on the time series of the test functions.


I think I found the issue. Note that in stumpy/stump.py, we do:

https://github.com/TDAmeritrade/stumpy/blob/d57169f3a8c3a1571c12d84f490e8c7e511b224e/stumpy/stump.py#L188

And, it seems that the high precision in std does not guarantee the high precision of its inverse when std is small.

Example:

>>> T
array([2.42145823e-01, 2.88656701e-04, 9.34900364e-04, 9.34545813e-04,
       9.36884255e-04, 3.72888654e-04, 3.99196666e-04, 9.97234658e-04,
       2.04350128e-04, 8.63723288e-01, 1.86409747e-01, 2.88656701e+02,
       9.34900364e+02, 9.34545813e+02, 9.36884255e+02, 3.72888654e+02,
       3.99196666e+02, 9.97234658e+02, 2.04350128e+02, 6.26529775e-01])

Let's take a look at std of this input T, with m=3, at index 2 (i.e. for subseq T[2 : 5])

# ref regards to exact std
# comp regards to welford std

np.abs(ref_std[2] - comp_std[2]) = 1.2878556459058026e-12
np.abs(ref_std_inv[2] - comp_std_inv[2]) = 1.2162515548989177

And, if we compare the matrix profile ref_P (from naive.stump) and comp_P(from stumpy.stump), we can see that the npt.assert_almost_equal(...) fails at index 2 and 12.

(The indices 2 and 12 are in fact the start index of two identical subsequences where one of them is scaled down by 0.001 (the subseq at index 2), and the other one is scaled up by 1000 (the subseq at index 12))

seanlaw commented 1 year ago

Ohh, wow! That's REALLY bad but also fun at the same time 😭

NimaSarajpoor commented 1 year ago

Yeah...I also tried many many different ways to resolve this. I have not been successful so far.


FYI:

(The indices 2 and 12 are in fact the start index of two identical subsequences where one of them is scaled down by 0.001 (the subseq at index 2), and the other one is scaled up by 1000 (the subseq at index 12))

If we scale up (down) by 10000 (0.0001), then:

np.abs(ref_std_inv[2] - comp_std_inv[2])

will be around 1000! (btw, we should also consider the impact of cov as well in calculating pearson as shown in the line of code provided in my previous comment)

The exact distance between subseq_2 and subseq_12 is 1e-16, and the stump (which is based on pearson) gives about 0.02.

seanlaw commented 1 year ago

np.abs(ref_std[2] - comp_std[2]) = 1.2878556459058026e-12 np.abs(ref_std_inv[2] - comp_std_inv[2]) = 1.2162515548989177

What happens when you print (1.0 / ref_std[2]) and print(1.0 / comp_std[2])?

Can you also print ref_std[2] and comp_std[2]? Or give me the seed to generate T precisely. I tried to copy/paste your T above but am not seeing the issue.

NimaSarajpoor commented 1 year ago

What happens when you print (1.0 / ref_std[2]) and print(1.0 / comp_std[2])?

>>> 1.0 / ref_std[2] 
971803.3300204428

>>> 1.0 / comp_std[2]
971802.1137688879

In case that matters: Since we scale the whole identical seq by 0.001, we are scaling down its standard deviation as well, so std(scaled_seq) is equal to 0.001 * std(seq). So, we are dealing with smaller std, and hence, its inverse becomes large. So let's say an actual value is $a$, and the value with imprecision is $a + \epsilon$, then, the imprecision of the inverse is:

$\frac{1}{a} - \frac{1}{a + \epsilon} = \frac{\epsilon}{a(a + \epsilon)}$, so, if $a$ is small itself, the imprecision of the inverse becomes large.

seanlaw commented 1 year ago

Can you tell me what seed and code you are using to generate T?

NimaSarajpoor commented 1 year ago

I think this should give you the T you are looking for.

def identical_cases_with_scale():
    m = 3
    zone = int(np.ceil(m / 4))

    seed = 27343
    np.random.seed(seed)

    identical = np.random.rand(8)
    T_A = np.random.rand(20)
    T_A[1 : 1 + identical.shape[0]] = identical * 0.001  
    T_A[11 : 11 + identical.shape[0]] = identical * 1000  

    ref_mp = stump(T_A, m, exclusion_zone=zone, row_wise=True)
    comp_mp = stumpy.stump(T_A, m, ignore_trivial=True)
    replace_inf(ref_mp)
    replace_inf(comp_mp)

    return T_A, ref_mp, comp_mp
seanlaw commented 1 year ago

Hmm, I wonder in compute_mean_std, if AFTER we compute Σ_T (stddev), would be worth:

  1. Check which stddev values are below some threshold
  2. Correct those stddevs

The reason why we use Welford is because of its speed and reasonable precision. As you identified, for the most part, what Welford produces is fine. However, for really small numbers, it's insufficient. And so maybe we add a correction where we use np.std for those specific windows?

Any thoughts?

seanlaw commented 1 year ago

So, after line #869 in core.compute_mean_stddev:

https://github.com/TDAmeritrade/stumpy/blob/67b822c9a6ec4a5128ad10c5bff287f0524109b5/stumpy/core.py#L860-L876

we can add something like this:

idx = np.where(Σ_T < config.STUMPY_STDDEV_CORRECTION_THRESHOLD)  # config.STUMPY_STDDEV_CORRECTION_THRESHOLD = 1e-5
if len(idx) > 0:
    Σ_T[idx] = np.std(core.rolling_window(T, m)[idx])

This might not work for the multi-dimensional case but I hope you can get my point

NimaSarajpoor commented 1 year ago

This might not work for the multi-dimensional case but I hope you can get my point

After I saw your suggestion, I tried it by calculating an upper bound for std, std_ub, in welford and then I checked if the std_ub is below the threshold 1e-5. In such case, I simply did np.nanvar(..). This fixes the issue and now the test function with scaled identical sequences pass.


Now, there is one more thing we might be interested in taking care of. See test function below:

def test_stump_volatile_case1():
        seed = 0
        np.random.seed(seed)

        T = np.random.rand(64)

        T_mask = np.full(len(T), 0, dtype=bool)
        k = np.random.randint(len(T) // 4, len(T))
        IDX = np.random.choice(np.arange(len(T)), k, replace=False)

        T_mask[IDX] = True
        slices = core._get_mask_slices(T_mask)

        scale = np.random.choice(np.array([0.001, 1000]), len(slices), replace=True)

        for i, (start, stop) in enumerate(slices):
            T[start:stop] = scale[i] * T[start:stop]

        m = 3
        zone = int(np.ceil(m / 4))
        ref_mp = naive.stump(T, m, exclusion_zone=zone)
        comp_mp = stump(T, m, ignore_trivial=True)
        naive.replace_inf(ref_mp)
        naive.replace_inf(comp_mp)

        ref_P = ref_mp[:, 0].astype(np.float64)
        comp_P = comp_mp[:, 0].astype(np.float64)

        npt.assert_almost_equal(ref_P, comp_P)

Here, I randomly scaled up or down the values.

btw, I just realized a cleaner way was to do:

scale = np.random.choice(np.array([0.001, 1, 1000]), len(T), replace=True)
And then:
T = T * scale

Okay, let's get back to our main topic. This test function still fails. In fact, I used the exact mean and std, i.e. np.nanmean(rolling_window(...), axis=1) and np.sqrt(np.nanvar(rolling_window(...), axis=1))! I wanted to make sure I am not inserting any inaccuracy from these two parameters.

Here is the error:

Mismatched elements: 11 / 62 (17.7%)
E           Max absolute difference: 0.03931141
E           Max relative difference: 0.65784229

This shows that the pearson approach, i.e. using rolling cov, is probably another source of imprecision. I haven't investigated this latter test case. so I don't know what is happening behind the scene!


Side note: It is worthwhile to note that the matrix profile indices are the same in this case. (So, we can keep in mind that our last resort can be recalculating distances based on the nearest-neighbor indices provided in I. Similar to what we did in stumpi to retrieve the left matrix profile values)

NimaSarajpoor commented 1 year ago

@seanlaw Btw, I would like to mention that the current solution, i.e. checking std and re-calculating its exact value may not be an efficient solution if ALL VALUES of time series are small, e.g. T=[0.001, 0.003, -0.002, 0 0.001, ....]. In this cases, we are going to end up with calculating the exact value of std for all subseqs (so, not fast)! I know this is a very rare case, but I just wanted to inform you of this case. I have some ideas but they are not well developed :)

seanlaw commented 1 year ago

Yeah, I don't think we should move away from using welford because it is not only reasonably precise but it doesn't use up all of our memory. With nanstd, it can use a ton of memory when we have a sufficiently large T. At some point, we just can't fix all precision issues. We can try our best and leave it at that. We also have to ask ourselves how likely are we to come across strange time series data like the contrived examples that we have?

seanlaw commented 1 year ago

Alternatively, can we somehow avoid the inverse and still use cov?

NimaSarajpoor commented 1 year ago

We also have to ask ourselves how likely are we to come across strange time series data like the contrived examples that we have?

Correct! I think the current version is already good and because of that we should just try to improve it without drastically sacrificing the speed.

Can we somehow avoid the inverse

Yeah...I think that might help if it is possible. I thought about it and even tried to re-arrange the equations in the code but that did not help. I will think about it again.

NimaSarajpoor commented 1 year ago

Update

This might not work for the multi-dimensional case but I hope you can get my point

After I saw your suggestion, I tried it by calculating an upper bound for std, std_ub, in welford and then I checked if the std_ub is below the threshold 1e-5. In such case, I simply did np.nanvar(..). This fixes the issue and now the test function with scaled identical sequences pass.

Now, there is one more thing we might be interested in taking care of. See test function below:

def test_stump_volatile_case1():

the test function def test_stump_volatile_case1(): fails even when mean and std are calculated naively AND fastmath is turned off.

seanlaw commented 1 year ago

the test function def test_stump_volatile_case1(): fails even when mean and std are calculated naively AND fastmath is turned off.

Can you help me understand what that means and if there is anything we could do?

NimaSarajpoor commented 1 year ago

Can you help me understand what that means and if there is anything we could do?

So I did fastmath=False to remove its impact on imprecision if there is any. I also avoided using the rolling approach for calculating "mean" and "std" of subsequences. And.... I still got error.

I am still investigating this. I just wanted to update you and seek your opinion.

seanlaw commented 1 year ago

So, naive computation of mean and stddev just means that we are using np.mean and np.std. This suggests that the issue for test_stump_volatile_case1() is coming from somewhere else and may be unrelated to the precision of np.mean and np.std (as it relates to computing the inverse stddev as we had discussed earlier).

Is that the correct interpretation?

NimaSarajpoor commented 1 year ago

Yes...exactly.

According to my investigation, I realized the error (in that case) is coming from the error of cov. The error is about 1e-11; however, the std is small (and hence its inverse is large) and that results in considerable imprecision in pearson.

Also:

$\rho{ij} = \frac{cov{ij}}{\sigma{i} \sigma{j}}$ and we know that $-1 \le \rho{ij} \le 1$. That means: $- \sigma{i} \sigma{j} \le cov{ij} \le \sigma{i} \sigma{j}$

So, whenever we get small value for $\sigma{i} \sigma{j}$, the $|cov|$ becomes small as well. Let's consider two cases for now:


So, one idea can be:

# in stumpy/stump.py
std_combined =  Σ_T[i + k] * σ_Q[i]
if  std_combined < 1e-8:
    # calculate `cov` using direct approach and not rolling appraoch

pearson = cov / std_combined      

(I did a similar thing. I did if cov < 1e-8: #calculate cov directly and it worked; however, I think std_combined < 1e-8 is better IF it works)

However, that means if the input time series T contains many many almost-flat subsequences, this can result in higher computing time.

seanlaw commented 1 year ago

Interesting!

I did a similar thing. I did if cov < 1e-8: #calculate cov directly and it worked; however, I think std_combined < 1e-8 is better IF it works

Wouldn't std_combined = Σ_T[i + k] * σ_Q[i] require passing Σ_T into _compute_diagonal? I think that undesirable. Unless you were thinking about putting it into a different function? Do you think this will significantly impact the readability of the code?

seanlaw commented 1 year ago

It is worthwhile to note that the matrix profile indices are the same in this case. (So, we can keep in mind that our last resort can be recalculating distances based on the nearest-neighbor indices provided in I. Similar to what we did in stumpi to retrieve the left matrix profile values)

Is this always true? Is it possible to have the wrong index if the distance is wrong? If not, then maybe as a post-processing step we simply compute the $\frac{1}{\sigma_Q[i]\Sigma_T[j]}$ for the final pairs of nearest neighbors and then only recompute the distances where they are below some threshold. Again, one would need to figure out if the indices are always correct regardless of the initial distances.

However, that means if the input time series T contains many many almost-flat subsequences, this can result in higher computing time.

Frankly, if the user's data contains mostly of almost-flat subsequences then it's probably a really, really poor dataset to begin with for matrix profiles in general 😢

NimaSarajpoor commented 1 year ago

however, I think std_combined < 1e-8 is better IF it works)

I tried my proposed solution (i.e. using std_combined < 1e-8 instead of if cov < 1e-8) and it failed! I will let you know if I fix it.

Wouldn't std_combined = Σ_T[i + k] * σ_Q[i] require passing Σ_T into _compute_diagonal? Do you think this will significantly impact the readability of the code?

I don't think so. It seems we are using Σ_T_inverse and σ_Q_inverse just for calculating pearson and nothing else. So, instead of passing Σ_T_inverse and σ_Q_inverse, we should be able to just pass Σ_T and σ_Q into _stump and then into _compute_PI.

Is this always true?

Well, it was true at the time I was writing that comment :) If I remember correctly, in one of the cases through my investigation, I faced a case where the indices of ref and comp were different. I will see if I can find that case again. If the indices are different but their corresponding distances are close, then we might be able to only do post-processing still.

Frankly, if the user's data contains mostly of almost-flat subsequences then it's probably a really, really poor dataset to begin with for matrix profiles in general 😢

Yeah...I meant if someone has T with 10M data points and it has two almost-flat subsequences with length 500k. Then, at some point throughout the calculation of matrix profile, calculating the cov directly might increase the time. But, as you said, it is not possible(?) to take care of everything.

seanlaw commented 1 year ago

I don't think so. It seems we are using Σ_T_inverse and σ_Q_inverse just for calculating pearson and nothing else. So, instead of passing Σ_T_inverse and σ_Q_inverse, we should be able to just pass Σ_T and σ_Q into _stump and then into _compute_PI.

Well, the reason we are passing Σ_T_inverse and σ_Q_inverse into _compute_diagonal is so that we don't need to repeatedly re-compute the inverse which should save us a bit of time. In the case where all subsequences are finite and none of the subsequences are constant then this means that pearson = cov * Σ_T_inverse[i + k] * σ_Q_inverse[i] is always being used. If we choose not to use inverse then we should assess the computational cost of doing inverse on-the-fly vs pre-computing it.

NimaSarajpoor commented 1 year ago

Well, the reason we are passing Σ_T_inverse and σ_Q_inverse into _compute_diagonal is so that we don't need to repeatedly re-compute the inverse which should save us a bit of time

Ah...right!

If we choose not to use inverse then we should assess the computational cost of doing inverse on-the-fly vs pre-computing i

Correct. Well, maybe I should do this first and then if I realize that the cost is negligible, I can try to work on imprecision issue.

seanlaw commented 1 year ago

Sounds good!

NimaSarajpoor commented 1 year ago

So, I have been working on this from time to time. There are a couple of things that I would like to share.


(1) Maybe I am missing something but it seems there is a small inconsistency between the outputs of naive.compute_mean_std and core.compute_mean_std, and core.preprocess.

# input: 
T = np.array([0, np.nan, 2, 3, 4, 5, 6, 7, np.inf, 9])
m = 3

And, the sliding standard deviation:

>>> Σ_T_naive # from `naive.compute_mean_std`
array([1.        , 0.5       , 0.81649658, 0.81649658, 0.81649658,
       0.81649658, 0.5       , 1.        ])

>>> Σ_T_compute # from `core.compute_mean_std`
array([1.        , 0.5       , 0.81649658, 0.81649658, 0.81649658,
       0.81649658, 0.        , 0.        ])

>>> Σ_T_preprocess  # from `core.preprocess`
array([1.        , 0.5       , 0.81649658, 0.81649658, 0.81649658,
       0.81649658, 0.5       , 1.        ])

Also, note that the second element and one-before-last element of T are both not finite. So, it might be better to set the std of their corresponding subsequences to 0 (and then those 0s will be replaced with 1 since we need to get their inverse).

So, I did:

is_finite = core.rolling_isfinite(T, m) # at the beginning of process
Σ_T[~is_finite] = 0 # at the end of process

And, it works. I tested test_core and test_stump.


(2) I also tried a more-expensive approach to calculate Σ_T. In short, I did something like this:

T_sqr = np.square(T)
M_T_sqr = # calculate sliding mean for np.square(T)
var_T =  M_T_sqr - np.square(M_T)
Σ_T =  np.sqrt(var_T)

I didn't provide the check for inf / nan values in the code snippet above; however, I considered them when I was implementing it. I tested and it works, and it can pass the test case where there are two identical subseqs, and one of them is multiplied by 1000, and the other by 0.001 (see this test function in PR#657)

Regarding the computing time, for n=100_000; m=50, it increases the time from 0.005 to 0.01. Note that the stump process itself is about 14sec (if I remember correctly). So, this is negligible. For n=1M, m=100, it increase the time from 0.067 to 0.1. And, for n=10M, m=500, it increases the time from 1.68 to 3.06. It is almost 100% increase in time. However, compared to the main process stump. this is negligible.

What do you think?


update side note: I think my approach, i.e. using sliding mean of np.square(T) to calculate Σ_T, might(?) result in error if elements of T are large values. I have not investigated it though. But first, I just want to know how you feel about this approach.

seanlaw commented 1 year ago

Maybe I am missing something but it seems there is a small inconsistency between the outputs of naive.compute_mean_std and core.compute_mean_std, and core.preprocess.

Maybe I'm not seeing it but it looks like naive.compute_mean_std and core.preprocess produce the same results, which is expected. In both of those cases, all np.inf are converted to np.nan first and then the mean and stddev are calculated. However, core.compute_mean_std only computes the pure mean and stddev and should be the same as np.nanmean and np.nanstd.

The source of the difference is when you allow np.inf to exist when computing, say, stddev. If I recall correctly, numpy simply ignores/removes the np.inf and computs the stddev using the remaining non-np.inf values. So, 0.5 comes from np.std([6, 7]) == np.std([6, 7, np.inf]) and the 1.0 comes from np.std([7, 9]) == np.std([7, np.inf, 9])

So, there is no imprecision.

Also, note that the second element and one-before-last element of T are both not finite. So, it might be better to set the std of their corresponding subsequences to 0 (and then those 0s will be replaced with 1 since we need to get their inverse).

We currently handle this in core.preprocess_diagonal

seanlaw commented 1 year ago

I also tried a more-expensive approach to calculate Σ_T. In short, I did something like this:

It's been a while and I don't remember what happened but I believe that this approach may lead to "catastrophic cancellation" and even worse precision issues.

side note: I think my approach, i.e. using sliding mean of np.square(T) to calculate Σ_T, might(?) result in error if elements of T are large values. I have not investigated it though. But first, I just want to know how you feel about this approach.

Exactly! This stuff is really nasty :(

NimaSarajpoor commented 1 year ago

Maybe I am missing something but it seems there is a small inconsistency between the outputs of naive.compute_mean_std and core.compute_mean_std, and core.preprocess.

Maybe I'm not seeing it but it looks like naive.compute_mean_std and core.preprocess produce the same results, which is expected. In both of those cases, all np.inf are converted to np.nan first and then the mean and stddev are calculated. However, core.compute_mean_std only computes the pure mean and stddev and should be the same as np.nanmean and np.nanstd.

The source of the difference is when you allow np.inf to exist when computing, say, stddev. If I recall correctly, numpy simply ignores/removes the np.inf and computs the stddev using the remaining non-np.inf values. So, 0.5 comes from np.std([6, 7]) == np.std([6, 7, np.inf]) and the 1.0 comes from np.std([7, 9]) == np.std([7, np.inf, 9])

So, there is no imprecision.

Also, note that the second element and one-before-last element of T are both not finite. So, it might be better to set the std of their corresponding subsequences to 0 (and then those 0s will be replaced with 1 since we need to get their inverse).

We currently handle this in core.preprocess_diagonal

Thanks for the explanation! It makes sense. I can see it now!

I also tried a more-expensive approach to calculate Σ_T. In short, I did something like this:

It's been a while and I don't remember what happened but I believe that this approach may lead to "catastrophic cancellation" and even worse precision issues.

side note: I think my approach, i.e. using sliding mean of np.square(T) to calculate Σ_T, might(?) result in error if elements of T are large values. I have not investigated it though. But first, I just want to know how you feel about this approach.

Exactly! This stuff is really nasty :(

Is that okay if I go ahead and try some changes and push them to the existing PR (#657) and see how it goes? (I usually test them on some important test functions. Then I can push the commits and then test it on all test functions. You can also see the code)

seanlaw commented 1 year ago

T_sqr = np.square(T) M_T_sqr = # calculate sliding mean for np.square(T) var_T = M_T_sqr - np.square(M_T) Σ_T = np.sqrt(var_T)

Can you use LaTeX and write out the exact formulas or provide the full code. This looks like some things are missing?

seanlaw commented 1 year ago

Is that okay if I go ahead and try some changes and push them to the existing PR (https://github.com/TDAmeritrade/stumpy/pull/657) and see how it goes? (I usually test them on some important test functions. Then I can push the commits and then test it on all test functions. You can also see the code)

Sure

NimaSarajpoor commented 1 year ago

Can you use LaTeX and write out the exact formulas or provide the full code. This looks like some things are missing?

In below, I am calculating the variance, $Var$, for a subsequence of length m. $\mu$ is the mean of subsequence. $\mu'$ is the mean of square of subsequence. (so, $\mu'$ is just np.mean(np.square(Q)), where Q is a subsequence with length m)

$Var = E[(x-\mu)^{2}]$ $Var = E[x^{2} - 2\mu x + \mu^{2}]$ $Var = E[x^{2}] - 2\mu E[x] + \mu^{2}$ $Var = E[x^{2}] - 2\mu E[x] + \mu^{2}$ $Var = \mu' - 2\mu \cdot \mu + \mu^{2}$ $Var = \mu' - \mu^{2}$

So, the variance is mean of square-of-subseq minus square of mean of subseq (this is the same as what provided in the wikipedia.)

So, I can use the sliding mean of square of T and the sliding mean of T to calculate the sliding Var. I think the error might be less. I need to investigate it further.


Downside: If the values of T gets REALLY large, then square of T becomes a series that contains even larger values, and hence maybe(?) worse precision. Again, I need to investigate it. So far, this approach passes test_core, test_stump and test_stumped (which was failing in PR #657)

or provide the full code.

I will push the changes to the PR #657.

seanlaw commented 1 month ago

What happens when we try this test case with the Matlab code? Do they experience any issues with precision? If so, then maybe we can email Eamonn about it and see if he has any suggestions.