JLDeng / ST-Norm

55 stars 12 forks source link

How to use it in transformer? #1

Open kpmokpmo opened 3 years ago

kpmokpmo commented 3 years ago

Hi, thanks for your work.

Just several quick questions here:

  1. When embedding the S/Tnorm blocks into the transformer baseline, should I discard or keep the original layer/group norm?
  2. It seems that your paper and 'Data Normalization for Bilinear Structures in High-Frequency Financial Time-series' sort of similar. Just curious if there is any main difference I didn't noticed.

Thank you very much!

JLDeng commented 3 years ago

Hi, thanks for you interest.

  1. According to my experience, you can keep the original layer, but it may depend on your task.
  2. Thanks for your suggestion. I have just checked this paper. I think the basic idea is similar. One of the major difference is that the normalized features should be combined with the original features and then fed to the following operations, otherwise the forecasting results would not be good.
JLDeng commented 3 years ago

In addition, I notice that they only applied normalization on the input data. Our work demonstrated that this operation can be generalized to latent space.

kpmokpmo commented 3 years ago

Thank you for quick reply! Well, I still want to double check about the design: if the attention block has the following structure:

S+T norm & concat+conv x = x + self.drop_path(self.attn(self.norm1(x))) x = x + self.drop_path(self.mlp(self.norm2(x)))

I think at least self.norm1 plays a duplicated role as the S/T norm layer. Please correct me if I shouldn't insert the ST norm here at all. Many thanks.