bukosabino / ta

Technical Analysis Library using Pandas and Numpy
https://technical-analysis-library-in-python.readthedocs.io/en/latest/
MIT License
4.36k stars 924 forks source link

Some indicators seem to be looking into the future #181

Open heinen opened 4 years ago

heinen commented 4 years ago

Hey folks,

Either I don't understand how ta works or I've found a bug. I would like to calculate a history of all technical indicators for a lot of stocks. It seems, though, as if ta also looks into the future for some of the indicators calculated in add_all_ta_features. I've constructed a minimum example to illustrate my problem using ta.momentum.kama:

import pandas as pd
import ta

close = pd.Series([56.77 , 57.31 , 56.83 , 56.58 , 56.52 , 55.5  , 55.99 , 55.51 ,
       55.91 , 55.55 , 55.83 , 55.71 , 60.21 , 60.21 , 60.21 , 60.21 ,
       60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 ,
       60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 57.17 ,
       56.56 , 56.09 , 56.72 , 55.99 , 55.83 , 57.33 , 56.17 , 56.24 ,
       55.89 , 55.34 , 55.81 , 56.49 , 56.9  , 57.69 , 58.78 , 59.27 ,
       57.72 , 58.56 , 58.51 , 59.31 , 59.1  , 60.21 , 60.21 , 60.21 ,
       60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 ,
       60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 ,
       60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 ,
       60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 ,
       60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 ,
       60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 ,
       60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 ,
       60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 , 60.21 ,
       60.21 , 46.71 , 58.175])

split_index = int(len(close)/2)
close_half = close.copy()
close_half= close_half.iloc[:split_index]

kama_tmp = ta.momentum.kama(close=close, fillna=True)
kama_tmp_half = ta.momentum.kama(close=close_half, fillna=True)

assert kama_tmp_half.equals(kama_tmp.iloc[:split_index])

This assertion fails. As far as I can see, the only way for it to fail is if kama looks at close values in the future. Am I missing something here? I.e.: Are technical indicators something else than different kinds of moving averages over the past? When using add_all_ta_features, I have similar issues with volume_vpt, trend_vortex_ind_pos, trend_vortex_ind_neg, trend_vortex_ind_diff, trend_trix, trend_dpo, trend_kst, trend_kst_sig, trend_kst_diff, trend_visual_ichimoku_a, trend_visual_ichimoku_b, others_dr.

Interestingly, when setting fillna=False, the above example does not fail the assertion. When using add_all_ta_features with fillna=False, though, I still have the same we-seem-to-be-looking-into-the-future-problem but with different features (tavolume_vpt, ta__trend_kst, tatrend_kst_sig, tatrend_kst_diff, tatrend_visual_ichimoku_a, tatrend_visual_ichimoku_b, ta__momentum_kama, taothers_dr). A minimal example for those is a bit more complex to construct but possible if necessary.

andybrewer commented 4 years ago

Not sure if this is related but Kama has this line: kama = self._check_fillna(kama, value=self._close) here: https://github.com/bukosabino/ta/blob/master/ta/momentum.py#L296

Most other functions pass 0 or 1 for the value param of _check_fillna, whereas Kama passes the entire close series.

_check_fillna uses ffill: https://github.com/bukosabino/ta/blob/master/ta/utils.py#L23, which in a timeseries, should take the earliest known value and propagate that forward. So, it's possible there's a bug, but it also seems possible that it's pushing past data into the future, rather than bringing future data into the present.

Also, the comment here (https://github.com/bukosabino/ta/blob/master/ta/utils.py#L14) about a -1 value triggering bfill doesn't seem accurate. According to https://github.com/bukosabino/ta/blob/master/ta/utils.py#L23 it looks like it simply replaces empty values with -1. Regardless, backfilling data if it was happening would seem to potentially be "looking ahead" at closing prices by bringing future data into the past.

Interestingly, ichimoku indicators use a -1 value for _check_fillna as seen here: https://github.com/bukosabino/ta/blob/master/ta/trend.py#L340. However, if fillna is set to False, these values shouldn't even matter.

Interesting bug though unless I'm missing something obvious.

ghost commented 4 years ago

Sounds very serious. Unintentional forward looking is the worst kind of bug in this type of library, IMHO. I haven't looked into the ones you mention, nor any other, but I know that the visual Ichimoku is supposed to look into the future. It's controlled by the visual argument, which shifts n2 if True (default is False).

EDIT: for anyone who might be interested, I implemented a simple check based on @heinen's code above in my pipeline:

def perform_TIs_forward_lookingness_check(df):
    """ Check for potential forward lookingness of TIs """
    test_len = 5000
    df_small = df.iloc[:test_len*2].copy()
    df_smaller = df.iloc[:test_len].copy()

    df_small = add_technical_features(df=df_small)
    df_smaller = add_technical_features(df=df_smaller)

    ti_cols = [col for col in df_small.columns if 'ti_' in col]
    for ti_col in ti_cols:
        assert df_smaller[ti_col].equals(df_small[ti_col].iloc[:test_len]), f'TI {ti_col} potentially forward looking'
    return

where function add_technicalfeatures appends new columns prefixed with "ti" to an OHLC df. As an example, the assert fails on ta.trend.IchimokuIndicator.ichimoku_a() if visual = True but not for visual=False.

myilmazhaubtech commented 3 years ago

I am personally very surprised to notice that the Ichimoku indicator has a serious data leak problem and it defeats the purpose of using a library like this. Given that feature importance may reveal an issue like this, what happens if there are other data leak issues in this library? This may kill an entire research effort.

In addition to Ichimoku the DPO indicator is also leaking data. But how do we know there is not more leakage in the rest of the indicators?

GerardBCN commented 3 years ago

There are few hidden gems in this library. Some I could find:

  1. trend_kst fills initial nan values by the mean of all close values
  2. in kama he mistakenly uses np.roll instead of shifting the series and dealing with the initial nans from shifting. np.roll specifically shifts the series forward but fills the first nans with the last values of the series. From their docs: Elements that roll beyond the last position are re-introduced at the first.
aldanor commented 3 years ago

It might be nice to have a MLStrategy as a default without any data leaks (e.g. kst/kama/ichimoku/dpo excluded or reconfigured).