WenjieDu / SAITS

The official PyTorch implementation of the paper "SAITS: Self-Attention-based Imputation for Time Series". A fast and state-of-the-art (SOTA) deep-learning neural network model for efficient time-series imputation (impute multivariate incomplete time series containing NaN missing data/values with machine learning). https://arxiv.org/abs/2202.08516
https://doi.org/10.1016/j.eswa.2023.119619
MIT License
292 stars 48 forks source link

Question about temporal dependencies and feature correlations captured by DMSA #11

Closed Stinger-Wiz closed 1 year ago

Stinger-Wiz commented 1 year ago

你好,关于文章当中的自注意力我有问题想请教您。维度为N×N的自注意力矩阵Q·Kt,表示的是长度为N的一种维度之间的注意力关系,而您文章中提到的“Such a mechanism makes DMSA able to capture the temporal dependencies and feature correlations between time steps in the high dimensional space with only one attention operation”,DMSA的一个注意力矩阵能一次性同时捕获到两种维度之间的注意力,想问一次注意力操作捕获到两种类型的注意力是怎么做到的。

WenjieDu commented 1 year ago

Hi there,

Thank you so much for your attention to SAITS! If you find SAITS is helpful to your work, please star⭐️ this repository. Your star is your recognition, which can let others notice SAITS. It matters and is definitely a kind of contribution.

I have received your message and will respond ASAP. Thank you again for your patience! 😃

Best,
Wenjie

How-Will commented 1 year ago

I have a time series dataset where each column represents a different time series. My understanding is that feature correlation describes the relationship between different time series, while temporal dependence describes the correlation between different time points within the same time series.

I am also puzzled about how a single attention operation can capture both temporal dependence and feature correlation?

Looking forward your reply

WenjieDu commented 1 year ago

Hi, first of all, thank you both @Stinger-Wiz @Will-Hor for raising this discussion.

In [^1], BRITS utilizes LSTM to produce history-based estimation and builds another component to produce feature-based estimation (please refer to Section 4.3 in [^1]), and then combines both of them to form the final imputation.

We claim DMSA can capture the temporal dependencies and feature correlations between time steps with only one attention operation because, different from BRITS, we only need one DMSA, which can capture the temporal and feature correlations between time steps. The attention map has already embedded temporal dependencies between time steps. With diagonal masks applied, as we introduced in Section 3.2.1 in [^2], input values at the t-th step can not see themselves and are prohibited from contributing to their own estimations. Consequently, estimations of the t-th step only depend on input values from other steps. It's worth mentioning that the component in BRITS to produce feature-based estimation is specially built to consider correlations between features of each time step, and its input is imputed data of the current step from the LSTM cell, namely this component works on the feature dimension. But DMSA works on the time dimension (this is why captured temporal dependencies and feature correlations are both between time steps). Due to that DMSA's input has already been projected into high dimensions (the features are fused) and SAITS does not make the imputation at this stage, DMSA does not need BRITS' component.

If you guys have new findings, you're welcome to share them with me 😊 Many thanks!

[^1]: Cao, W., Wang, D., Li, J., Zhou, H., Li, L., & Li, Y. (2018). BRITS: Bidirectional Recurrent Imputation for Time Series. NeurIPS 2018. [^2]: Du, W., Cote, D., & Liu, Y. (2023). SAITS: Self-Attention-based Imputation for Time Series. Expert systems with applications.

Stinger-Wiz commented 1 year ago

How should we understand the phrase 'DMSA works in the time dimension'? If attention maps represent the attention between each time step, where is the correlation between features reflected?

Looking forward your reply🌹

WenjieDu commented 1 year ago

Hi, thank you for your patience. The input of DMSA is fused information from the features in a d_model dimensional space (here we just ignore multi-head splitting for simplification). The fused information means the information of features gets leaked with each other. Please note this is the key point and it is different from BRITS' feature-regression module operating in the original space with n_features dimensions. Therefore, when DMSA working, it will extract correlations between features from other T-1 steps to estimate the missing part as best as possible. And note, the feature correlation here is different from the temporal dependency, because the latter represents the temporal correlations between time steps and is embedded in the attention map, while the former represents that DMSA manipulates fused information and utilizes the information leakage to extract the correlations. The truth is such feature correlation extraction is implicit and isn't explicit as the attention map or BRITS' feature regression, so it may be confusing to some of our readers. I'd like to thank you again for raising this issue. And you can validate my explanation above by appending a feature-regression module explicitly to the DMSA block. In my before experiments, it brings no accuracy improvement but only extra parameters.

WenjieDu commented 1 year ago

Hi, guys @Stinger-Wiz @Will-Hor, does my previous reply sound reasonable to you? If you have any other questions about this issue, feel free to tell me :-)

Stinger-Wiz commented 1 year ago

Thank you for your detailed explanation, now there is no problem with this research, thank you for your assistance :-)

WenjieDu commented 1 year ago

@Stinger-Wiz My pleasure. Also thank you very much for your attention to SAITS! If you think it is inspiring or helpful to your work, please star🌟 the repo to help more people notice this work. Also please take a look at our new work PyPOTS which may be useful. 😃 Many thanks for your contribution again!

How-Will commented 1 year ago

thank you very much. Your reply really helped me a lot.