Some differences in the code and the paper.

ForestsKing commented 10 months ago

While reading the code and attempting to implement it independently, I identified several differences in the code and the paper:

The definitions of the loss function and the anomaly score do not match Equation (9) and Equation (10) in the paper. Correspondingly, the way employed for threshold determination in the code diverges from the approach outlined in Section 4.3 of the paper.
The code for computing $\text{Attn} {\mathcal{N} i}$ and $\text{Attn} {\mathcal{P} i}$ does not match the equations (2) and (5) in the paper. The denominator in the paper is $\sqrt{d {\text {model }}}$, while in the code it is $\sqrt{\frac{d {\text{model }}} {H}}$.
There seems to be an error in the code when splitting the patches. For the univariate time series in Figure (a), its Patch-wise and In-patch embedding should be as shown in Figure (b) and Figure (c), respectively. The code can't do this as shown in Figure (a), and the correct one is shown in Figure (b):
The code does not seem to sum the representations of multiple patch sizes according to equations (7) and equations (8) in the paper, instead it sums the their KL divergence distances when calculating the loss. As far as we know, these two operations are not equivalent.
Equation (3) and Equation (6) in the paper seem to be wrong.
- The code does not concat the multiple heads as stated in the paper, but averages them after evaluating their respective KL divergence.
- There is no $W {\mathcal{N}}^O$ and $W {\mathcal{P}}^O$ in the code. In fact, this multiplication does not work at all. $\text{Attn} {\mathcal{N}}$ ($\text{Attn} {\mathcal{P}}$) has a shape of $B\times H\times N \times N$ ($B\times H\times P \times P$), and it cannot be multiplied at all by a $W {\mathcal{N}}^O$ ($W {\mathcal{P}}^O$) with a shape of $d {\text {model }} \times d {\text{model }}$($d {\text {model }} \times d {\text {model }}$), concated or not.
Each attention layer in the encoder has an input shape of $(BC)\times H\times P\times P$ ($(BC)\times H\times N\times N$)and an output shape of $B \times H \times (NP)\times (NP)$. Because of the inconsistent shapes, the individual attention layers cannot be connected in series, and the code uses a parallel approach and sums the KL divergence of the different attention layers. This is not mentioned at all in the paper.

tianzhou2011 commented 10 months ago

Thank you immensely for your comprehensive implementations. It is imperative that we engage in a discussion amongst the authors to evaluate the veracity of these disparities. However, it is highly probable that we may have made some inadvertent misinterpretations in the paper, considering that the code truly reflects our final actions and decisions.

ShinoMashiru commented 1 day ago

While reading the code and attempting to implement it independently, I identified several differences in the code and the paper:

The definitions of the loss function and the anomaly score do not match Equation (9) and Equation (10) in the paper. Correspondingly, the way employed for threshold determination in the code diverges from the approach outlined in Section 4.3 of the paper.

The code for computing AttnNi and AttnPi does not match the equations (2) and (5) in the paper. The denominator in the paper is dmodel , while in the code it is dmodel H.

There seems to be an error in the code when splitting the patches. For the univariate time series in Figure (a), its Patch-wise and In-patch embedding should be as shown in Figure (b) and Figure (c), respectively. The code can't do this as shown in Figure (a), and the correct one is shown in Figure (b):

The code does not seem to sum the representations of multiple patch sizes according to equations (7) and equations (8) in the paper, instead it sums the their KL divergence distances when calculating the loss. As far as we know, these two operations are not equivalent.

Equation (3) and Equation (6) in the paper seem to be wrong.

The code does not concat the multiple heads as stated in the paper, but averages them after evaluating their respective KL divergence.

There is no WNO and WPO in the code. In fact, this multiplication does not work at all. AttnN (AttnP) has a shape of B×H×N×N (B×H×P×P), and it cannot be multiplied at all by a WNO (WPO) with a shape of dmodel ×dmodel (dmodel ×dmodel ), concated or not.

Each attention layer in the encoder has an input shape of (BC)×H×P×P ((BC)×H×N×N)and an output shape of B×H×(NP)×(NP). Because of the inconsistent shapes, the individual attention layers cannot be connected in series, and the code uses a parallel approach and sums the KL divergence of the different attention layers. This is not mentioned at all in the paper.

ForestsKing博士，您好！我一开始在阅读论文的过程中也产生了和您类似的疑惑，我一开始的理解是(请允许我将其命名为方法a)： a)对于如公式（2）和公式（5）的注意力层计算，一般情况下应该是QK点积之后再乘一个形状为N(P)×d/H的矩阵V，这样得到的AttnNi(AttnPi) 形状就是N(P)×d/H，这样后续的concat操作得到AttnN(AttnP) 的形状就是N(P)×d，后续的Up-sampling也能和论文所描述的一样进行。所以我一开始认为论文的问题是作者忘记提及QKV中的V所导致的。但是后来我尝试复现代码并仔细阅读了论文的图2 发现其注意力的输出层形状为H×T(NP)×T(NP)，而且结合您的回复来看，作者在代码中采用的正是如您所说的方法b)(请允许我将其命名为方法b)。因此我产生了以下两种猜想：

1.论文作者的想法与a)方法一致，但是忘了注意力层计算中提及乘V矩阵这一操作，补充这一点之后包括concat后乘矩阵W^O，以及Up-sampling的求和操作，都非常符合论文中的描述。但这样做与论文的代码和论文图2中的描述不一致。 2.论文作者的想法与b)方法一致，这样做与代码保持一致，且符合图2的描述，但正如您再方法b中所提到的，论文中的concat后乘矩阵W^O这一操作并不存在，Upsampling操作也有问题。

我希望作者能够确认，论文与代码不一致的情况究竟是如何造成的？是如猜想1所说，作者在论文中体现的才是真实想法，但是在实现代码的过程中出现了偏差？还是如猜想2所说，代码才是作者团队真正的想法，论文的撰写者理解出现偏差造成了论文表达错误？

两种方法究竟哪一个才是真正有效的？而且与Anomaly transformer类似，作者的实验都采用了PA adjustment这一trick，这一trick遭受了许多研究者的怀疑，认为有效性主要来源于PA adjustment而非模型本身。作者的论文与代码描述的不一致似乎加重了大家对此的怀疑程度。

希望作者能够对此做出解答！

ShinoMashiru commented 1 day ago

Thank you immensely for your comprehensive implementations. It is imperative that we engage in a discussion amongst the authors to evaluate the veracity of these disparities. However, it is highly probable that we may have made some inadvertent misinterpretations in the paper, considering that the code truly reflects our final actions and decisions.

作者团队您好，请问能否回复一下我提出的问题？

tianzhou2011 commented 23 hours ago

“considering that the code truly reflects our final actions and decisions.”。请以代码为准，文章中最终report的大表实验结果都是最后一版代码跑出的。由于实验多轮设计，中间结构有调整，写作和coding又由不同同学分别进行，导致文章中确实犯了一些表述和代码不同的错误。

ShinoMashiru commented 23 hours ago

“considering that the code truly reflects our final actions and decisions.”。请以代码为准，文章中最终report的大表实验结果都是最后一版代码跑出的。由于实验多轮设计，中间结构有调整，写作和coding又由不同同学分别进行，导致文章中确实犯了一些表述和代码不同的错误。

感谢您的回复

DAMO-DI-ML / KDD2023-DCdetector

Some differences in the code and the paper. #10