TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis
https://stumpy.readthedocs.io/en/latest/
Other
3.62k stars 318 forks source link

A mistake in the docstring of function `preprocess_diagonal` in `stumpy/core.py` #675

Closed NimaSarajpoor closed 1 year ago

NimaSarajpoor commented 1 year ago

I have been exploring different functions in stumpy/core.py that are related to "mean" and "std" of T, and I noticed that the docstring of function preprocess_diagonal needs to be revised.

According to the docstring: "Every subsequence that contains at least one NaN or inf value will have a False value in its T_subseq_isfinite bool arra and it will also have a mean of np.inf. "

However, the latter part is not correct. In fact, the calculated sliding mean has finite values.

Example:

import numpy as np
import stumpy

T = np.array([np.nan, 1, 2.5, 3, 4, 1, 2, np.inf, 5])
m = 3

out = stumpy.core.preprocess_diagonal(T, m)
>>> out[1]
array([1.16666667, 2.16666667, 3.16666667, 2.66666667, 2.33333333,
       1.        , 2.33333333])

An opportunity to simplify the (pre-)process functions:

I THINK we do not need to be worried about the values of M_T and Σ_T for non-finite values IF we just use T_subseq_is_finite to handle subseqs with non-finite values. In other words, we probably just need the following line for the preprocessing step:

# line 1754
T, T_subseq_isfinite = preprocess_non_normalized(T, m)

In fact, I think we can just use the name new_preprocess as it should work in both normalized and non-normalized case. So:

# whether it is normalized or non-normalized case
T, T_subseq_isfinite = new_preprocess(T_origin, m) # which is basically doing what function `preprocess_non_normalized` does

And, that's it. We now compute sliding mean and std on new T (which has just finite values) and we handle non-finite values with T_subseq_isfinite (and maybe using np.isfinite(T_origin) as well).

Maybe I missed something in my explanation above, but I hope it is clear enough to deliver my point :) It should probably be safe to get rid of preprocess(T, m) function now as all we need is the function new_preprocess.

seanlaw commented 1 year ago

it will also have a mean of np.inf.

I think this may have been copied over from core.preprocess and not fixed. Would you mind updating this? Also, can you please fix the typo from arra to array

I hope it is clear enough to deliver my point

Sorry, I'm not able to get your point. 😢

NimaSarajpoor commented 1 year ago

think this may have been copied over from core.preprocess and not fixed. Would you mind updating this? Also, can you please fix the typo from arra to array

Will do so :)

Also, you may want to change the name of the function preprocess_non_normalized. The docstring says: "Preprocess a time series that is to be used when computing a non-normalized ...". However, in stump, we use function preprocess_diagonal which calls the function preprocess_non_normalized! So, I am not sure if it is okay to use the term non_normalized in the name of function (This is somehow related to what I am about to say below)


Sorry, I'm not able to get your point.

If we take a look at stump module, we can see that it uses preprocess_diagonal(T_A, m) function, which returns some outputs including the preprocessed T_A (let's call it T_A'), sliding mean of T_A', μ_Q, and T_A_subseq_isfinite.

We can see that in function _compute_diagonal (in stumpy/stump.py), we use T_subseq_isfinite to avoid the subseqs that have nan or inf values. The calculated μ_Q has just finite values even for those subseqs with nan or inf values. but, that is okay. I think this is a very good approach. I think we can do the same thing in other places.

For instance, currently we have this:

# `stumpy/motifs.py`
# in function `motifs`
T, M_T, Σ_T = core.preprocess(T[np.newaxis, :], m)

# `stumpy/aamp_motifs.py`
# in function `aamp_motifs`
T, T_subseq_isfinite = core.preprocess_non_normalized(T[np.newaxis, :], m)

Alternatively, we may do:

T, T_subseq_isfinite = preprocess_non_normalized(T, m) # in BOTH motifs (normalized) and aamp_motifs (non-normalized) version 

# And if it is a z-normalize case, i.e. motifs.py, we add the following line as well:
M_T, Σ_T = compute_mean_std(T, m) # because we need mean and std for normalized case. (easy to read compared to preprocess)

Advantages: (1) The only thing we need to remember is that T_subseq_is_finite is our main reference for handling subseqs with inf/nan.

(2) I think this can help with the challenge discussed in #636. Because, now we can have _my_private_func(T, m, T_subseq_is_finite, M_T, Σ_T) for normalized and _my_private_func(T, m, T_subseq_is_finite) for non-normalized case.

(3) Also, right now, we have this function:

# in `stumpy/core.py`
def preprocess(T, m):
    T = _preprocess(T)
    check_window_size(m, max_size=T.shape[-1])
    T[np.isinf(T)] = np.nan
    M_T, Σ_T = compute_mean_std(T, m)
    T[np.isnan(T)] = 0

    return T, M_T, Σ_T

However, I think we do not need to replace inf with nan and compute its M_T and Σ_T. All we need is to first obtain T_subseq_is_finite, then replace nan/inf with 0, and move forward to do the rest of calculation.

NimaSarajpoor commented 1 year ago

@seanlaw Btw, it is possible that I am overthinking again‌ 😀

seanlaw commented 1 year ago

Also, you may want to change the name of the function preprocess_non_normalized. The docstring says: "Preprocess a time series that is to be used when computing a non-normalized ...". However, in stump, we use function preprocess_diagonal which calls the function preprocess_non_normalized! So, I am not sure if it is okay to use the term non_normalized in the name of function (This is somehow related to what I am about to say below)

I don't see a problem here

seanlaw commented 1 year ago

Btw, it is possible that I am overthinking again‌

I think I get your point. For now, I think we are over-analyzing it but I am open to leaving this issue open in case we find that this can become a bigger benefit.

NimaSarajpoor commented 1 year ago

don't see a problem here

I was referring to the fact that it is somehow used in stump, which is for normalized version. Never mind :) I think I have the "overthinking" disease 😃