[Suggestion] Normalization.

AI4Finance-Foundation / FinRL-Meta

FinRL-Meta: Dynamic datasets and market environments for FinRL.

https://ai4finance.org

MIT License

1.21k stars 567 forks source link

[Suggestion] Normalization. #83

Open cryptocoinserver opened 2 years ago

cryptocoinserver commented 2 years ago

Adding normalization to the data preprocessor might be a great feature:

Min-Max Normalization,
Decimal Scaling Normalization,
Z-Score Normalization,
Median Normalization,
Sigmoid Normalization,
Tanh estimators
Bhanja, Samit & Das, Abhishek. (2018). Impact of Data Normalization on Deep Neural Network for Time Series Forecasting. ResearchGate

These are more advanced / adaptive approaches:

Passalis, Nikolaos, u. a. Deep Adaptive Input Normalization for Time Series Forecasting. 2019. Github Repo arXiv:1902.07892
Nalmpantis, Angelos, u. a. „Deep Adaptive Group-Based Input Normalization for Financial Trading“. Pattern Recognition Letters, Bd. 152, Dezember 2021, S. 413–19. DOI.org (Crossref), https://doi.org/10.1016/j.patrec.2021.11.004
Tran, Dat Thanh, u. a. Bilinear Input Normalization for Neural Networks in Financial Forecasting. 2021. arXiv:2109.00983

cryptocoinserver commented 2 years ago

Just a note: There is a danger of lookahead/data leaking when implementing normalization using the whole dataset. Therefore it needs to be carefully done inside the environment at each step (with a certain lookback). I saw some environments use normalization already: https://github.com/AI4Finance-Foundation/FinRL-Meta/blob/203bb7d3f890220bb3e82bc5e34b65051a0b61dc/finrl_meta/env_crypto_trading/env_multiple_crypto.py#L94

According to the paper from ResearchGate the Tanh estimator is most promising.

cryptocoinserver commented 2 years ago

Good explanation regarding the lookahead problem suggesting an expanding or rolling window: https://stats.stackexchange.com/questions/442739/look-ahead-bias-induced-by-standardization-of-a-time-series/462976#462976

zhumingpassional commented 2 years ago

Just a note: There is a danger of lookahead/data leaking when implementing normalization using the whole dataset. Therefore it needs to be carefully done inside the environment at each step (with a certain lookback). I saw some environments use normalization already:

https://github.com/AI4Finance-Foundation/FinRL-Meta/blob/203bb7d3f890220bb3e82bc5e34b65051a0b61dc/finrl_meta/env_crypto_trading/env_multiple_crypto.py#L94

According to the paper from ResearchGate the Tanh estimator is most promising.

Yes. It will greatly influence the normalization. What about add a column 'if_leaking' to denote the data. In normalization process, we will ignore the rows 'if_leaking' == true. Do you have any idea to solve it?