MichalRIcar commented 1 year ago

Hello,

I have came to an issue where (E/S)MA produces extra high missing rate when just 1 obs is missing among many OBS (e.g. 2K). I have cross-checked with pd.rolling that this issue is a bug.

Reproduction is done in 2 steps

when missing obs are from the very beginning of Data then results TALIB == PANDAS
when missing appears inside Data then TALIB <> PANDAS, where TALIB may produce 100% missing rate where PANDAS only 0.1%

Bug reproduction:

STEP ZERO → Function Definition

Function

def Simulation(Missing):
    import talib
    import numpy as np
    from pandas import DataFrame as pDF

    # Missing observation position in Data
    m = Missing

    # MA period
    t = 10

    # Data
    T = np.random.rand(2000,1)
    T = pDF(T)

    # Injecting NAN
    T.loc[m:m+1] = np.nan

    # Var
    X = T[0]

    # MA
    T.loc[:, 'SMA_TaLib']  = talib.SMA(X, timeperiod=t)
    T.loc[:, 'SMA_Pandas'] = X.rolling(t).mean()

    # MISSING
    print('TALIB Missing OBS:',T['SMA_TaLib'].isna().sum())
    print('PANDA Missing OBS:',T['SMA_Pandas'].isna().sum())

S1 → TALIB == PANDAS

Simulation(0) Output: TaLib Missing OBS: 11 Panda Missing OBS: 11

S2 → TALIB <> PANDAS

Simulation(1) Output: TaLib Missing OBS: 2000 Panda Missing OBS: 12

So that only 1 missing OBS at the second position of the Data caused 100% missing rate in TALIB case...

mrjbq7 commented 1 year ago

The NaN will cause all further data points to be NaN:

In [7]: c = np.random.randn(100)

In [8]: ta.EMA(c)
Out[8]: 
array([        nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan, -0.20010195,
       -0.14629603, -0.19322205, -0.11284334, -0.08979806, -0.09922255,
       -0.00886839,  0.02769315,  0.01621726,  0.09114756,  0.0997015 ,
        0.06605619,  0.00544552,  0.00368205,  0.12548891,  0.25605504,
        0.29726329,  0.30825576,  0.23318661,  0.21047578,  0.0786684 ,
        0.13126033,  0.05632908,  0.09003275,  0.12241143,  0.1368664 ,
        0.18871725,  0.21911695,  0.25844934,  0.25344702,  0.27039838,
        0.22954116,  0.26607002,  0.24966657,  0.1446552 ,  0.16729002,
        0.18874245,  0.09147541,  0.08174418,  0.19589613,  0.31269955,
        0.2625159 ,  0.35677455,  0.30980691,  0.30356763,  0.27470739,
        0.24884736,  0.24725236,  0.23586413,  0.3471254 ,  0.28572394,
        0.36162036,  0.40297618,  0.30590914,  0.30621277,  0.25766893,
        0.13943467,  0.24240124,  0.27100324,  0.10478242,  0.13936646,
        0.15219173,  0.2476177 ,  0.24939736,  0.24692743,  0.26384937,
        0.21859485,  0.32376657,  0.35259057,  0.2076861 ,  0.22139453])

In [9]: c[50] = np.nan

In [10]: ta.EMA(c)
Out[10]: 
array([        nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan, -0.20010195,
       -0.14629603, -0.19322205, -0.11284334, -0.08979806, -0.09922255,
       -0.00886839,  0.02769315,  0.01621726,  0.09114756,  0.0997015 ,
        0.06605619,  0.00544552,  0.00368205,  0.12548891,  0.25605504,
        0.29726329,  0.30825576,  0.23318661,  0.21047578,  0.0786684 ,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan,
               nan,         nan,         nan,         nan,         nan])

mrjbq7 commented 1 year ago

I do think TA-Lib does the "more correct" thing, rather than ignoring NaN's which would seem to me to be wrong in most cases -- if you need the pandas behavior, replace the NaN's before calling TA-Lib, by either filtering them, propagating the previous data point, interpolating the data points, or however data-specific thing you need to deal with them.

mrjbq7 commented 1 year ago

You could also use pandas.DataFrame.fillna, if your data is in pandas before calling into TA-Lib...

MichalRIcar commented 1 year ago

I do use this workaround → pandas.DataFrame.fillna ← to fill NA, however, this behavior is a bug - and can lead to serious issues in ML - so definitely worthy of sharing.. Especially when PANDAS result has expected behavior - 1 OBS can't produce 100% missing rate of xK of OBS - that is against statistical basics, ML , and, from mpov, common sense.

MichalRIcar commented 1 year ago

I do think TA-Lib does the "more correct" thing, rather than ignoring NaN's which would seem to me to be wrong in most cases -- if you need the pandas behavior, replace the NaN's before calling TA-Lib, by either filtering them, propagating the previous data point, interpolating the data points, or however data-specific thing you need to deal with them.

TA-Lib "ignores" NaN same way as pandas when NaN are from the beginning of the Data...thus, should behave same when NaN is randomly inside the Data

mrjbq7 commented 1 year ago

What does Pandas do by default when replacing the NaN in the rolling mean calculation?

And would you expect that to work for all users of TA-Lib? There's a reason fillna() takes arguments.

In any event, this is a wrapper of the underlying C library, which behavior I'm not going to be able to change -- the only thing we could do is have some kind of automatic NaN handling, and I don't think that's a good idea.

mrjbq7 commented 1 year ago

In [12]: pd.Series([1,2,3,np.nan,4,5,6])
Out[12]: 
0    1.0
1    2.0
2    3.0
3    NaN
4    4.0
5    5.0
6    6.0
dtype: float64

In [15]: s.rolling(3).mean()
Out[15]: 
0    NaN
1    NaN
2    2.0
3    NaN
4    NaN
5    NaN
6    5.0
dtype: float64

So, it resets the rolling window and then waits for non-NaN values before generating new observations -- that's pretty nice. Too bad the underlying TA-Lib doesn't do that.

MichalRIcar commented 1 year ago

OK, thanks for the quick response, I believe worthy of sharing.

mrjbq7 commented 1 year ago

I agree, it’s something I’m used to now but is surprising and clearly it could be better.

I’ll make a note in the README about it.

On Wed, Sep 7, 2022 at 7:25 AM MichalRIcar @.***> wrote:

OK, thanks for the quick response, I believe worthy of sharing.

— Reply to this email directly, view it on GitHub https://github.com/mrjbq7/ta-lib/issues/542#issuecomment-1239463169, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAAF5A637FJYHVH6RLVXK4LV5CQUXANCNFSM6AAAAAAQGZY7HU . You are receiving this because you commented.Message ID: @.***>

mrjbq7 commented 1 year ago

Added a note about NaN handling b0439e6e0565647e21b1ceee245bad4fac55aad6

TA-Lib / ta-lib-python

MA functions produces 100% NA with 1 missing OBS of 2k OBS #542

STEP ZERO → Function Definition

Function

S1 → TALIB == PANDAS

S2 → TALIB <> PANDAS