borisbanushev / stockpredictionai

In this noteboook I will create a complete process for predicting stock price movements. Follow along and we will achieve some pretty good results. For that purpose we will use a Generative Adversarial Network (GAN) with LSTM, a type of Recurrent Neural Network, as generator, and a Convolutional Neural Network, CNN, as a discriminator. We use LSTM for the obvious reason that we are trying to predict time series data. Why we use GAN and specifically CNN as a discriminator? That is a good question: there are special sections on that later.
4.07k stars 1.65k forks source link

Feature Extraction Bug: FFT Data Leakage causing Fake Result #363

Open nova-land opened 1 year ago

nova-land commented 1 year ago

The FFT should not be considered a proper feature.

The FFT is constructed from the whole dataset where the earlier values would be affected by the future data. For example, If you remove the last row of the price data, the whole FFT values will be different.

If the Author can achieve good accuracy, it is mainly based on data leakage.

The following code will contain data leakage.

The original section: Link

close_fft = np.fft.fft(np.asarray(data_FT['GS'].tolist()))
fft_df = pd.DataFrame({'fft':close_fft})
fft_df['absolute'] = fft_df['fft'].apply(lambda x: np.abs(x))
fft_df['angle'] = fft_df['fft'].apply(lambda x: np.angle(x))

The whole project can become invalid just because of such data leakage. Every step after will be GIGO. Even a decent MLP Model can have a good result with such data leakage.


A possible solution:

df = ... # The Price Data

periods = [3, 6, 9]
index_data = []
for p in periods:
    data[f'abs_{p}'] = []
    data[f'angle_{p}'] = []

# Calculate the FFT only to the latest row
# Caution: The range(1, len(df)) should be changed as the early data will be useless with such small data to calculate the FFT value.
for i in range(1, len(df)):
    window = df[:i]['close']
    index_data.append(df.index[i])
    fft_close = np.fft.fft(window.values)
    absolute = np.abs(fft_close)
    angle = np.angle(absolute)

    for p in periods:
        fft_list = np.copy(fft_close)
        fft_list[p:-p] = 0

        final_fft = np.fft.ifft(fft_list)
        absolute = np.abs(final_fft)[-1]
        angle = np.angle(final_fft)[-1]

        data[f'abs_{p}'].append(absolute)
        data[f'angle_{p}'].append(angle)

In such a case, you will notice the huge difference, which WILL NOT capture the same movement from the Author's FIGURE. This proves the project result performance is based on data leakage.

Caution: Separating the training and testing data before using the Author's original FFT feature will still cause data leakage. The problem is FFT can only be calculated at 'seen' data. Otherwise, it will use the whole dataset to calculate the FFT value.