Feature Extraction Bug: FFT Data Leakage causing Fake Result

The FFT should not be considered a proper feature.

The FFT is constructed from the whole dataset where the earlier values would be affected by the future data. For example, If you remove the last row of the price data, the whole FFT values will be different.

If the Author can achieve good accuracy, it is mainly based on data leakage.

The following code will contain data leakage.

The original section: Link

close_fft = np.fft.fft(np.asarray(data_FT['GS'].tolist()))
fft_df = pd.DataFrame({'fft':close_fft})
fft_df['absolute'] = fft_df['fft'].apply(lambda x: np.abs(x))
fft_df['angle'] = fft_df['fft'].apply(lambda x: np.angle(x))

The whole project can become invalid just because of such data leakage. Every step after will be GIGO. Even a decent MLP Model can have a good result with such data leakage.

A possible solution:

df = ... # The Price Data

periods = [3, 6, 9]
index_data = []
for p in periods:
    data[f'abs_{p}'] = []
    data[f'angle_{p}'] = []

# Calculate the FFT only to the latest row
# Caution: The range(1, len(df)) should be changed as the early data will be useless with such small data to calculate the FFT value.
for i in range(1, len(df)):
    window = df[:i]['close']
    index_data.append(df.index[i])
    fft_close = np.fft.fft(window.values)
    absolute = np.abs(fft_close)
    angle = np.angle(absolute)

    for p in periods:
        fft_list = np.copy(fft_close)
        fft_list[p:-p] = 0

        final_fft = np.fft.ifft(fft_list)
        absolute = np.abs(final_fft)[-1]
        angle = np.angle(final_fft)[-1]

        data[f'abs_{p}'].append(absolute)
        data[f'angle_{p}'].append(angle)

In such a case, you will notice the huge difference, which WILL NOT capture the same movement from the Author's FIGURE. This proves the project result performance is based on data leakage.

Caution: Separating the training and testing data before using the Author's original FFT feature will still cause data leakage. The problem is FFT can only be calculated at 'seen' data. Otherwise, it will use the whole dataset to calculate the FFT value.

borisbanushev / stockpredictionai

Feature Extraction Bug: FFT Data Leakage causing Fake Result #363