TDAmeritrade / stumpy

STUMPY is a powerful and scalable Python library for modern time series analysis
https://stumpy.readthedocs.io/en/latest/
Other
3.65k stars 318 forks source link

RuntimeWarning: divide by zero encountered in divide #1006

Open NimaSarajpoor opened 3 months ago

NimaSarajpoor commented 3 months ago

Running the following script

import numpy as np
import stumpy
import time

def test_extract_several_consensus():
    Ts = [np.random.rand(n) for n in [64, 128]]
    Ts_ref = [T.copy() for T in Ts]
    Ts_comp = [T.copy() for T in Ts]

    m = 20

    k = 2  # Get the first `k` consensus motifs
    for _ in range(k):
        print(f'===== {_} =====')
        radius, Ts_idx, subseq_idx = stumpy.ostinato(Ts_comp, m)

        consensus_motif = Ts_comp[Ts_idx][subseq_idx : subseq_idx + m].copy()
        for i in range(len(Ts_comp)):
            if i == Ts_idx:
                query_idx = subseq_idx
            else:
                query_idx = None

            idx = np.argmin(stumpy.core.mass(consensus_motif, Ts_comp[i], query_idx=query_idx))

            Ts_comp[i][idx : idx + m] = np.nan
            Ts_ref[i][idx : idx + m] = np.nan

            np.testing.assert_almost_equal(
                Ts_ref[i][np.isfinite(Ts_ref[i])], Ts_comp[i][np.isfinite(Ts_comp[i])]
            )

if __name__ == "__main__":
    test_extract_several_consensus()

Gives:

===== 0 =====
===== 1 =====
/Users/nimasarajpoor/miniconda3/envs/py310/lib/python3.10/site-packages/stumpy/core.py:2257: 
RuntimeWarning: divide by zero encountered in divide

I took a look at the code:

https://github.com/TDAmeritrade/stumpy/blob/e7bb2b85ee8761683480ac093437e71989252d15/stumpy/core.py#L2248-L2260

and tried to swap lines 2252 and 2253, but that did not work.


Also: If I disable JIT, I get an extra warning, saying:

python3.10/site-packages/numpy/lib/nanfunctions.py:1879: RuntimeWarning: Degrees of freedom <= 0 for slice.
seanlaw commented 3 months ago

Hmmm, this seems strange. The error is coming from line 2257:

Σ_T_inverse = 1.0 / Σ_T 

It would be good to dig into what the inputs and outputs are. It suggests that perhaps:

T_subseq_isconstant = process_isconstant(T, m, T_subseq_isconstant) 

isn't capturing/processing the right constant subsequences?

NimaSarajpoor commented 3 months ago

Right! I will take a closer look (Also, I want to see what happens to T_subseq_isconstant when a subsequence is all np.nan)

NimaSarajpoor commented 3 months ago

@seanlaw

and tried to swap lines 2252 and 2253, but that did not work.

Okay. I think this works! (Probably I made a mistake before).

isn't capturing/processing the right constant subsequences?

I can see that this is happening when we have a subsequence with all non-finite values. For example, running the following script throws the divide by zero encountered in divide warning.

import numpy as np

from stumpy import core

# based on the function `stumpy.core.preprocess_diagonal`
def compute_std_inverse(T):  
    T[~np.isfinite(T)] = np.nan
    T_subseq_isconstant = core.process_isconstant(T, m, T_subseq_isconstant=None)

    T[np.isnan(T)] = 0
    _, Σ_T = core.compute_mean_std(T, m)
    Σ_T[T_subseq_isconstant] = 1.0
    Σ_T_inverse = 1.0 / Σ_T

    return Σ_T_inverse

if __name__ == "__main__":
    n, m = 10, 3
    T = np.random.rand(n)

    nan_start = n // 2
    T[nan_start : nan_start + m] = np.nan

    compute_std_inverse(T)

we are mixing two things in the line Σ_T[T_subseq_isconstant] = 1.0 . Σ_T is computed for a pre-processed time series T (where all non-finite values are replaced with 0). However, T_subseq_isconstant is NOT computed for that pre-processed time series T.

The reason that we did not get any failure in the unit testing of stumpy.stump is because 1.0 / Σ_T will be used just for computing the pearson correlation down the road (and not the intermediate variables like cov) and, also, we do the Pearson correlation only for subsequence idx when T_subseq_isfinite[idx] is True.


To avoid similar issue in the future, we may at least throw a warning when T in process_isconstant(T, m, ...) has one/more non-finite value(s). We may say something like this:

The input time series `T` contains non-finite values. At those indices, the value of returned array is not meaningful.

If we decide to go with this, then we may need to re-think the usage of stumpy.core.fix_isconstant_isfinite_conflicts

seanlaw commented 3 months ago

Okay. I think this works! (Probably I made a mistake before).

What is the logic behind swapping the two lines? There's a reason why it is in its current order and I think the current order matters! So, I am against swapping as it will change the overall/existing logic

we are mixing two things in the line Σ_T[T_subseq_isconstant] = 1.0 . Σ_T is computed for a pre-processed time series T (where all non-finite values are replaced with 0). However, T_subseq_isconstant is NOT computed for that pre-processed time series T.

The reason that we did not get any failure in the unit testing of stumpy.stump is because 1.0 / Σ_T will be used just for computing the pearson correlation down the road (and not the intermediate variables like cov) and, also, we do the Pearson correlation only for subsequence idx when T_subseq_isfinite[idx] is True.

Sorry, I'm still not understanding the problem and how this is caused by ostinato. Please allow me some time to review the code that you shared in your first comment. Perhaps, in the meantime, you can take me step-by-step what is happening in the ostinato code above and what eventually triggers the warning when we modify the original time series with np.nan ?

NimaSarajpoor commented 3 months ago

Part 1

What is the logic behind swapping the two lines?

(I noticed that in ostinato, the story is a bit different. I explained it in the Part 2 below.) The logic is that T_subseq_isconstant should capture the constant subsequences AFTER setting non-finite values to 0.0.

This logic comes from the answer to this question: "what should one expect from Σ_T[T_subseq_isconstant] = 1.0"? The expectation is: if Σ_T[i] is 0.0, then T_subseq_isconstant[i] must be True, so that Σ_T[i] can be set to 1.0 . However, currently, this expectation is not being met for a non-finite subsequence whose finite values, if there is any, are 0.0.

Example:

T = np.array([np.nan, 0.0, np.nan])
m = 3
T_subseq_isconstant = core.process_isconstant(T, m, T_subseq_isconstant=None)
print(T_subseq_isconstant)

T[np.isnan(T)] = 0
_, Σ_T = core.compute_mean_std(T, m)
print(Σ_T)

Σ_T[T_subseq_isconstant] = 1.0
Σ_T_inverse = 1.0 / Σ_T
print('Σ_T_inverse: ', Σ_T_inverse)

Running code above gives:

T_subseq_isconstant:  [False]
Σ_T:  [0.]

RuntimeWarning: divide by zero encountered in divide Σ_T_inverse = 1.0 / Σ_T
Σ_T_inverse:  [inf]

Another example:

# running the following should give us that warning
n, m = 10, 3
T = np.random.rand(n)
T[:m] = np.nan

stumpy.stump(T, m)

Part 2

Perhaps, in the meantime, you can take me step-by-step what is happening in the ostinato code above and what eventually triggers the warning when we modify the original time series with np.nan ?

In ostinato, we can see the following flow (This is a bit different than Part 1 above!!): (1) process each time series in Ts by applying the function core.preprocess (and we also compute Ts_subseq_isconstant) (2) pass the processed data and pre-computed Ts_subseq_isconstant to _ostinato (3) _ostinato calls stump (we pass the processed time series and the pre-computed Ts_subseq_isconstant to it) (4) In stump, we call core.preprocess_diagonal and pass THAT already-processed data and the pre-computed Ts_subseq_isconstant to it. (5) core.preprocess_diagonal(T, m, ...) is shown below:

https://github.com/TDAmeritrade/stumpy/blob/0107d39b632b748dffe5f419d08517ea3f6fc866/stumpy/core.py#L2248-L2260

(Side Note: In this case, T is already processed and has ONLY finite value. More on this on Part 3).

In line T_subseq_isconstant = process_isconstant(T, m, T_subseq_isconstant) we are passing THAT processed T and the pre-computed T_subseq_isconstant. process_isconstant returns the pre-computed T_subseq_isconstant. Although the processed data T has 0.0 instead of np.nan, the T_subseq_isconstant is False for any subsequence that was originally non-finite. Therefore, for subsequence idx, T_subseq_isconstant[idx] can be False while Σ_T[idx] is 0.0. So, Σ_T_inverse = 1.0 / Σ_T can throw a warning again here.

Example:

# running the following gives us the warning
n, m = 10, 3

TA = np.random.rand(n)
TB = np.random.rand(n)
TC = np.random.rand(n)
TA[:m] = np.nan

stumpy.ostinato([TA, TB, TC], m)

Here, I also see an extra warning:

tumpy/core.py:3467: UserWarning: A large number of values in `P` are smaller than 1e-06.
For a self-join, try setting `ignore_trivial=True`.

Part 3

As mentioned in Part 2, _ostinato passes preprocessed data to stump. Therefore, the computed matrix profile is not correct for a subsequence that was originally non-finite. However, I think, later in the code, the core._mass (see line 259 below) will take care of it.

https://github.com/TDAmeritrade/stumpy/blob/0107d39b632b748dffe5f419d08517ea3f6fc866/stumpy/ostinato.py#L239-L276

This is because the corresponding value of such subsequence in M_T above is np.inf which affects the distance here:

https://github.com/TDAmeritrade/stumpy/blob/0107d39b632b748dffe5f419d08517ea3f6fc866/stumpy/core.py#L1093-L1106

So, we will eventually end up with correct value for bsf_radius.

Example: Running the following script

def func():
    Ts = [
        np.random.rand(100), 
        np.random.rand(100),
        np.random.rand(100)
    ]

    m = 3
    Ts[0][:m] = np.nan

    Ts_subseq_isconstant = None

    Ts_copy = [None] * len(Ts)
    M_Ts = [None] * len(Ts)
    Σ_Ts = [None] * len(Ts)
    if Ts_subseq_isconstant is None:
        Ts_subseq_isconstant = [None] * len(Ts)

    for i, T in enumerate(Ts):
        Ts_copy[i], M_Ts[i], Σ_Ts[i], Ts_subseq_isconstant[i] = core.preprocess(
            T, m, copy=True, T_subseq_isconstant=Ts_subseq_isconstant[i]
        )

    mp = stumpy.stump(
            Ts_copy[0],
            m,
            Ts_copy[1],
            ignore_trivial=False,
            T_A_subseq_isconstant=Ts_subseq_isconstant[0],
            T_B_subseq_isconstant=Ts_subseq_isconstant[1],
        )

    print('matrix profile value for the first subsequence in Ts[0]: ', mp[0, 0])

if __name__ == "__main__":
    func()

gives:

core.py:2257: RuntimeWarning: divide by zero encountered in divide
  Σ_T_inverse = 1.0 / Σ_T

matrix profile value for the first subsequence in Ts[0]:  0.0

However, the matrix profile value should have been np.inf.

NimaSarajpoor commented 3 months ago

(I noticed that in ostinato, the story is a bit different. I explained it in the Part 2 below.) The logic is that T_subseq_isconstant should capture the constant subsequences AFTER setting non-finite values to 0.0

I think this logic is aligned with computing M_T, Σ_T here:

https://github.com/TDAmeritrade/stumpy/blob/0107d39b632b748dffe5f419d08517ea3f6fc866/stumpy/core.py#L2248-L2260

as the line M_T, Σ_T = compute_mean_std(T, m) occurs after T[np.isnan(T)] = 0.


In core.preprocess, we do all these computation (i.e. M_T, Σ_T, AND T_subseq_isconstant) before T[np.isnan(T)] = 0 (see below)

https://github.com/TDAmeritrade/stumpy/blob/0107d39b632b748dffe5f419d08517ea3f6fc866/stumpy/core.py#L2135-L2142

I feel it make sense as long as we do all these relevant computation next to each other (and provide clear description in the docstring / comment)

seanlaw commented 3 months ago

The logic is that T_subseq_isconstant should capture the constant subsequences AFTER setting non-finite values to 0.0.

The more I read this, the more I disagree. While it might seem "smart" to do it this way, it is polluting the meaning of T_subseq_isconstant. The vector T_subseq_isconstant should ONLY be capture whether or not each subsequence within the ORIGINAL time series, T, is constant. Therefore, it is 100% correct that [np.nan, 0.0, np.nan] is NOT a constant subsequence and we should insist upon this! Otherwise, it will lead to a lot of confusion in the future if this is not maintained and kept "pure"

In ostinato, we can see the following flow (This is a bit different than Part 1 above!!): (1) process each time series in Ts by applying the function core.preprocess (and we also compute Ts_subseq_isconstant) (2) pass the processed data and pre-computed Ts_subseq_isconstant to _ostinato (3) _ostinato calls stump (we pass the processed time series and the pre-computed Ts_subseq_isconstant to it) (4) In stump, we call core.preprocess_diagonal and pass THAT already-processed data and the pre-computed Ts_subseq_isconstant to it. (5) core.preprocess_diagonal(T, m, ...) is shown below:

If the issue is that Ts_copy[i] is an already preprocessed time series then, perhaps, the solution would be to change ostinato to something like:

    Ts_copy = [None] * len(Ts)
    M_Ts = [None] * len(Ts)
    Σ_Ts = [None] * len(Ts)
    if Ts_subseq_isconstant is None:
        Ts_subseq_isconstant = [None] * len(Ts)

    for i, T in enumerate(Ts):
        _, M_Ts[i], Σ_Ts[i], Ts_subseq_isconstant[i] = core.preprocess(
            T, m, copy=True, T_subseq_isconstant=Ts_subseq_isconstant[i]
        )
        Ts_copy[i] = T.copy()

    bsf_radius, bsf_Ts_idx, bsf_subseq_idx = _ostinato(
        Ts_copy, m, M_Ts, Σ_Ts, Ts_subseq_isconstant
    )

This way Ts_copy[i] would NOT be a processed copy of Ts[i] and, instead, is a RAW copy of Ts[i]. Would that work? The ONLY reason that we are pre-computing M_Ts, Σ_Ts, Ts_subseq_isconstant is to avoid re-computing them later and, instead, we pass in those arrays. However, we can accept the small cost of making a copy of Ts[i].

Then, all of the M_Ts, Σ_Ts, Ts_subseq_isconstant would correspond correctly to the UNPROCESSED Ts_copy and we can add a comment to ostinato to remind the reader that we DON'T use a processed copy of Ts[i] because stump will process it again itself?

I haven't tested this but it captures the spirit of what I am hoping for

I think this logic is aligned with computing M_T, Σ_T here:

Yeah, I don't think I agree that "this is the logic". With the exception of converting non-finite numbers into np.nan, I am adamant that T_subseq_isconstant should reflect the truth/reality of the original raw time series!

NimaSarajpoor commented 3 months ago

Therefore, it is 100% correct that [np.nan, 0.0, np.nan] is NOT a constant subsequence and we should insist upon this! Otherwise, it will lead to a lot of confusion in the future if this is not maintained and kept "pure"

So, in that case:

(i) In core.preprocess_diagonal: should we replace

https://github.com/TDAmeritrade/stumpy/blob/3077d0ddfb315464321dc86f8ec3bf2cab9ce3b1/stumpy/core.py#L2256

with

mask =  np.logical_or(T_subseq_isconstant, Σ_T==0.0)
Σ_T[mask] = 1.0

to avoid getting warning that comes from the line Σ_T_inverse = 1.0 / Σ_T?

Note: This is to avoid getting warning in core.preprocess_diagonal and the functions that call it, likestumpy.stump. I will talk about stumpy.ostinato later below but this is just for core.preprocess_diagonal and stumpy.stump.

(ii) should we improve the docstring of core.preprocess_diagonal

https://github.com/TDAmeritrade/stumpy/blob/3077d0ddfb315464321dc86f8ec3bf2cab9ce3b1/stumpy/core.py#L2232-L2246

by saying something like this:

? (or, is that too much?)


Regarding Part 2 and Part 3 of my previous comment

This way Ts_copy[i] would NOT be a processed copy of Ts[i] and, instead, is a RAW copy of Ts[i]. Would that work? The ONLY reason that we are pre-computing M_Ts, Σ_Ts, Ts_subseq_isconstant is to avoid re-computing them later and, instead, we pass in those arrays. However, we can accept the small cost of making a copy of Ts[i].

Then, all of the M_Ts, Σ_Ts, Ts_subseq_isconstant would correspond correctly to the UNPROCESSED Ts_copy and we can add a comment to ostinato to remind the reader that we DON'T use a processed copy of Ts[i] because stump will process it again itself?

Got the point.

Note1: If we only implement

mask =  np.logical_or(T_subseq_isconstant, Σ_T==0.0)
Σ_T[mask] = 1.0

(in core.preprocess_diagonal) then we do not get warning in stumpy.ostinato when we pass the PROCESSED T to stumpy.stump. However, the issue I pointed out in "Part 3" of my previous comment remains.

Note 2: I made the following change in stumpy.ostinato

for i, T in enumerate(Ts):
        _, M_Ts[i], Σ_Ts[i], Ts_subseq_isconstant[i] = core.preprocess(
            T, m, copy=True, T_subseq_isconstant=Ts_subseq_isconstant[i]
        )
        Ts_copy[i] = T.copy()

The tests in test_ostinato.py are passing. Also, the AB-join matrix profile

https://github.com/TDAmeritrade/stumpy/blob/3077d0ddfb315464321dc86f8ec3bf2cab9ce3b1/stumpy/ostinato.py#L239-L246

(that is computed in _ostinato) is now correct. So, the concern raised in the part 3 of my previous comment is addressed.

Note 2-WARNING

In the same function _ostinato, we now pass the UNPROCESSED T to core._mass.

https://github.com/TDAmeritrade/stumpy/blob/3077d0ddfb315464321dc86f8ec3bf2cab9ce3b1/stumpy/ostinato.py#L259

This is a njit-decorated function with fastmath=True. The docstring says:

This private function assumes only finite numbers in your inputs and it is the responsibility of the user to pre-process and post-process their results if the original time series contains np.nan/np.inf values.

we may try to revise the fastmath flag (see #708).

seanlaw commented 3 months ago

mask = np.logical_or(T_subseq_isconstant, Σ_T==0.0) Σ_T[mask] = 1.0

So, I like the spirit of this and I think it is headed in the right direction. However, it seems implicit/dangerous to set all Σ_T[Σ_T==0.0] == 1.0 since we don't necessarily know WHAT is causing Σ_T==0.0. Instead, I would rather set Σ_T to 1.0 for each specific condition that we know of. For example, maybe we should be adding something like:

Σ_T[T_subseq_isconstant] = 1.0
Σ_T[~T_subseq_isfinite] = 1.0
# Add other conditions

This way, it is explicitly clear WHEN Σ_T needs to be set to 1.0. Then, we should add unit tests to ALL of those scenarios

If we only implement ... (in core.preprocess_diagonal) then we do not get warning in stumpy.ostinato when we pass the PROCESSED T to stumpy.stump. However, the issue I pointed out in "Part 3" of my previous comment remains

Again, if we are explicit in our code about WHY we are setting something to 1.0 then I am okay. Logically, is there any reason beyond:

  1. A subsequence containing non-finite numbers
  2. A subsequence having constant values

for which Σ_T would be 0.0?

seanlaw commented 3 months ago

should we improve the docstring of core.preprocess_diagonal by saying something like this: M_T is the rolling mean of processed T, where non-finite values are replaced with 0.0.

I think it's better to say ". Each subsequence in T that contains one or more non-finite values will have its corresponding M_T value set to 0.0.

T_subseq_isconstant[i] False wherever T_subseq_isfinite[i] is True.

Umm, I think you mean "T_subseq_isconstant[i] is automatically set to False whenever T_subseq_isfinite[i] is False."??

seanlaw commented 3 months ago

So, the concern raised in the part 3 of my previous comment is addressed.

Cool! I'd like to see the final code for this.

In the same function _ostinato, we now pass the UNPROCESSED T to core._mass.

I don't like this. I think we should agree that the private functions should already be preprocessed. We may need to re-think things more in order to ensure consistency

seanlaw commented 3 months ago

The tests in test_ostinato.py are passing.

Have you added the test above to the unit tests?

NimaSarajpoor commented 3 months ago

@seanlaw

it is explicitly clear WHEN Σ_T needs to be set to 1.0.

Logically, is there any reason beyond:

  1. A subsequence containing non-finite numbers
  2. A subsequence having constant values for which Σ_T would be 0.0?

There is no more reason other than those two. I agree. Being explicit helps us to know for what subsequences exactly we are setting the standard deviation to 1.0.

Umm, I think you mean "T_subseq_isconstant[i] is automatically set to False whenever T_subseq_isfinite[i] is False."??

Right! My bad 😅

In the same function _ostinato, we now pass the UNPROCESSED T to core._mass.

I don't like this.

Right. If we decide to pass UNPROCESSED data to _ostinato, it will then be passed to core._mass as well. Need to think more on it.

The tests in test_ostinato.py are passing.

Have you added the test above to the unit tests?

So, what I meant was that changing the code does not result in failing current tests. Will add new tests.