Dataset code - Githubissues

w1nn1ethepooh commented 2 months ago

Thank you for your fancy job!

I would like to ask if there is any source of code for generating the data sets dl_train, dl_valid and dl_test.

Have a nice day!

LITONG99 commented 1 month ago

Thank you for your attention to our work. We are not planning to publish the data infrastructure for explained reasons. If you need to process raw data, we highly recommend reusing the Qlib implementations. Here is the configuration:

infer_processors:

class: RobustZScoreNorm kwargs: fields_group: feature clip_outlier: true

class: Fillna kwargs: fields_group: feature learn_processors:

class: DropnaLabel

class: DropExtremeLabel kwargs: percentile: 0.975

class: CSZscoreNorm kwargs: fields_group: label

Please note that, except for DropExtremeLabel, the above configuration is used for many models in qlib/examples/benchmarks and we do use the Qlib implementations in producing the published dl_train, dl_valid, and dl_test. The DropExtremeLabel is implemented in our commercial codebase, which should be easy to implement in Qlib as well, since it obeys a simple rule to drop 2.5% of the highest/lowest labels.

ElonJustin7 commented 1 month ago

Hi, thank you for your outstanding work!

I'd like to ask about the "Mask" in the information regarding market indices (such as 000300) in your dataset. What does it refer to? Thank you!

caozhiy commented 1 month ago

I think it is a Qlib data operator 'qlib.data.ops.Mask'. You can refer to https://qlib.readthedocs.io/en/latest/reference/api.html#module-qlib.data.ops for more details.

Hi, thank you for your outstanding work! I'd like to ask about the "Mask" in the information regarding market indices (such as 000300) in your dataset. What does it refer to? Thank you!

SJTU-Quant / MASTER

Dataset code #3