Closed nutquant closed 1 month ago
Thank you for your attention to our work.
First of all, you probably would like to refer to this issue#3, where we list the specific operators used for the features and labels correspondingly. You can find the source code for every operator here /qlib/data/dataset/processor.py
A1: No. For features, we use RobustZScoreNorm. It compute mean/std (robust) for each feature of all stocks in the training timespan. It then clip outliers as -3, and 3. For the test data, we also conduct normalizetion, but the mean/std for each feature is estimated by (or borrowed from) the training data, so that we have no data leakage. For labels, we use CSZscoreNorm. Here CS stands for Cross-Sectional, which means we group the labels on each date and compute mean/std across stocks.
A2: No, it uses 'Fillna' which fills nan with the default value 0. 'Fillna' is performed after 'RobustZScoreNorm'.
A3: The problem of missing stocks exists because the data source is unstable. In other words, when we use automatic tools (such as crawler scripts) to collect the stock data from public source, we did not successfully collect it for every stock on every date. Such missing data exisits in both our published data and Qlib official data.
I will also fix the confusing descriptions in Readme later. For any further questions, please stick to the description in the original paper and the source code.
Thank you very much for your work; it has been a great help to me. However, I have a few questions regarding data preprocessing:
Q1: Daily Z-Score Normalization: Does this involve calculating the mean and standard deviation of the features on a daily basis and then normalizing the features based on these daily statistics? Could you please confirm if my understanding is correct?
Q2: Removal of NA Features: How is this process carried out? Does it include deleting the corresponding samples?
Q3: I noticed that on certain dates, the number of stocks in the CSI 300 data is less than 300, often missing data for 20 to 30 stocks. What is the reason for these missing data?