CAAN module - Githubissues

gillmac13 commented 2 years ago

I have studied your paper and accompanying code, and have been able to reproduce your results on a reduced list of 160 stock files. I do have one comment relative to the CAAN module. According to the paper as well as the code, the CAAN module performs its self attention computation within each batch of decoder outputs. By doing so, we have, if we suppose that the batch size (256) is smaller than the number of stocks (300):

for batch#0, 256 day#0 samples querying 256 day#0 values, leaving 44 day#0 samples untouched.
for batch#1, 44 remaining day#0 samples from above are able to query among 44 day#0 values but also 212 day#1 values, in other words, allowed to look into the future.

This process goes on with a shift of 44 samples at each subsequent batch, and its consequence is a look ahead bias situation. If placed in a real time trading context, we would need to know some tomorrow values to compute today's outputs. I think the only way to avoid this situation would be to perfectly align (date wise) the list of stocks with valid data to the batch size. Going back to 2010, that would maybe eliminate a third of the stocks. The random validation set should also be disabled in order to preserve the exact alignment. To further test my comment and the global architecture, I have disconnected the CAAN module. But in that case, the training accuracy does not go past 65%, and the testing accuracy never goes anywhere above 52%, not generalizing at all. I am well aware that I maybe all wrong (in fact I do hope so). I'd be very glad to know what are your thoughts on this subject...

q-learning-trader commented 1 year ago

Interesting observation @gillmac13 . Did you go further with you analysis?

To further test my comment and the global architecture, I have disconnected the CAAN module.

What does it mean exactly?

putskan commented 1 year ago

Any news on that? @gillmac13

kwan-0915 commented 5 months ago

@gillmac13 I think you are correct. At the very beginning, everything goes fine (creating price graph, price ci, price embedding), but the way they (authors) create the validation set makes things go wrong (see trainer.py line 122-125). The random masking ends up with a current day data mixed-up with future day data.

For example (assuming we have only 3 stocks), the first 10 elements in self.tain["day"] looks like: [20200103, 20200103, 20200103, 20200104, 20200104, 20200104, 20200105, 20200105, 20200105, 20200105], after the random masking from validation set, the first 10 elements in train_data["day"] could be [20200103, 20200103, 20200103, 20200104, 20200104, 20200105, 20200105, 20200107, 20200108, 20200110]. The batch size now does not matter, since the data already misaligned, if batch size = 1, there does not have enough data to perform cross-attention, if the batch size >= 2, for the second batch we have [20200103, 20200104] as input, which is a mix of current and future data.

The idea proposed in this paper might be helpful, if anyone is interested, please check if any other paper uses this one as a reference, and see how other papers construct their model. A solid code sample is better than any words in the paper.

BUAA-WJR / PriceGraph

CAAN module #2