Question About Data Preprocessing

Thank you for your attention to our work.

First of all, you probably would like to refer to this issue#3, where we list the specific operators used for the features and labels correspondingly. You can find the source code for every operator here /qlib/data/dataset/processor.py

A1: No. For features, we use RobustZScoreNorm. It compute mean/std (robust) for each feature of all stocks in the training timespan. It then clip outliers as -3, and 3. For the test data, we also conduct normalizetion, but the mean/std for each feature is estimated by (or borrowed from) the training data, so that we have no data leakage. For labels, we use CSZscoreNorm. Here CS stands for Cross-Sectional, which means we group the labels on each date and compute mean/std across stocks.

A2: No, it uses 'Fillna' which fills nan with the default value 0. 'Fillna' is performed after 'RobustZScoreNorm'.

A3: The problem of missing stocks exists because the data source is unstable. In other words, when we use automatic tools (such as crawler scripts) to collect the stock data from public source, we did not successfully collect it for every stock on every date. Such missing data exisits in both our published data and Qlib official data.

I will also fix the confusing descriptions in Readme later. For any further questions, please stick to the description in the original paper and the source code.

SJTU-DMTai / MASTER

Question About Data Preprocessing #17