DAMO-DI-ML / KDD2023-DCdetector

170 stars 20 forks source link

Seems that [v2] version of your paper is wrong? While [v1] is right? #23

Closed Leopold2333 closed 11 months ago

Leopold2333 commented 11 months ago

I notice that the author updated a new paper version of DCdetector on arXiv on 11 Oct. Has the author really confirmed that this version of paper is πŸ†—? From the perspective of model design, the [v1] version of the paper makes more sense logically, especially the Loss function part! It's just that this so-called "Official Code" does not match that in the paper well. All changes made by paper version [v2] appear to be in response to questions raised by readers prior to 18 Sept. Currently I am still confused of the Loss function in this project, it's still the same as Anomaly Transformer, and there's no other branches of this code, so NO SUBSTANTIAL CHANGES have been made to the current code at all ⁉️⁉️ 😨😨😨? Shouldn't the author change the code for the sake of the paper, rather than changing the paper for the sake of the code❓❓❓❓❓

tianzhou2011 commented 11 months ago

We have made revisions to the draft in order to address any misunderstandings regarding the methodology. The codes provided reflect the actual implementation used during our experiments, and these codes were responsible for producing the reported results. The reason for the changes in the draft is that version 1 contained some inaccuracies in method description. Rest assured, our code remains unchanged. Yes, there have been no changes to the codes. It is important to note that any modifications could possibly compromise the reproducibility of our experiments. Moreover, we did not employ the same loss as the anomaly transformer; rather, we solely utilized contrastive loss. We encourage you to carefully examine the code.

On Tue, Oct 31, 2023 at 4:53β€―PM Aobo Liang @.***> wrote:

I notice that the author updated a new paper version of DCdetector on arXiv on 11 Oct. Has the author really confirmed that this version of paper is πŸ†—? From the perspective of model design, the [v1] version of the paper makes more sense logically, especially the Loss function part! It's just that this so-called "Official Code" does not match that in the paper well. All changes made by paper version [v2] appear to be in response to questions raised by readers prior to 18 Sept. Currently I am still confused of the Loss function in this project, it's still the same as Anomaly Transformer https://github.com/thuml/Anomaly-Transformer, and there's no other branches of this code, so NO SUBSTANTIAL CHANGES have been made to the current code at all ⁉️⁉️ 😨😨😨? Shouldn't the author change the code for the sake of the paper, rather than changing the paper for the sake of the code❓❓❓❓❓

β€” Reply to this email directly, view it on GitHub https://github.com/DAMO-DI-ML/KDD2023-DCdetector/issues/23, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3JGO4II4HKN5CAKLI2GULYCC4BXAVCNFSM6AAAAAA6XKAMECVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3DSOJUGU3TANI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

Leopold2333 commented 11 months ago

We have made revisions to the draft in order to address any misunderstandings regarding the methodology. The codes provided reflect the actual implementation used during our experiments, and these codes were responsible for producing the reported results. The reason for the changes in the draft is that version 1 contained some inaccuracies in method description. Rest assured, our code remains unchanged. Yes, there have been no changes to the codes. It is important to note that any modifications could possibly compromise the reproducibility of our experiments. Moreover, we did not employ the same loss as the anomaly transformer; rather, we solely utilized contrastive loss. We encourage you to carefully examine the code. … On Tue, Oct 31, 2023 at 4:53β€―PM Aobo Liang @.> wrote: I notice that the author updated a new paper version of DCdetector on arXiv on 11 Oct. Has the author really confirmed that this version of paper is πŸ†—? From the perspective of model design, the [v1] version of the paper makes more sense logically, especially the Loss function part! It's just that this so-called "Official Code" does not match that in the paper well. All changes made by paper version [v2] appear to be in response to questions raised by readers prior to 18 Sept. Currently I am still confused of the Loss function in this project, it's still the same as Anomaly Transformer https://github.com/thuml/Anomaly-Transformer, and there's no other branches of this code, so NO SUBSTANTIAL CHANGES have been made to the current code at all ⁉️⁉️ 😨😨😨? Shouldn't the author change the code for the sake of the paper, rather than changing the paper for the sake of the code❓❓❓❓❓ β€” Reply to this email directly, view it on GitHub <#23>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3JGO4II4HKN5CAKLI2GULYCC4BXAVCNFSM6AAAAAA6XKAMECVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3DSOJUGU3TANI . You are receiving this because you are subscribed to this thread.Message ID: @.>

The value "series_loss" and "prior_loss" do come from different calculation process compared with Anomaly Transformer, however, in the [v1] version, the Loss function seems to have a right logic, because it's ok to make KL divergence of two representations from the same normal series points as small as it can, while the Loss in [v2] does not. What confused me is that, as a Deep Learning Model, it's not reasonable that the loss value of each of your epoch is always "0.", right? The code now still gets a zero-loss during its training process, if this is true, what is the insight of this minus operation? Even though the author explained for this issue in #7 , it's still confusing, because KL Divergence itself could represent the similarity of comparing the two representations, why bother adding another "minus" operation which makes the loss always to zero?

Leopold2333 commented 11 months ago

image

Even the vali_loss keeps to be 0.0, it seems that any difference it gets is just because of the seed is not fixed.

tianzhou2011 commented 11 months ago

It is important to clarify that the value is not exactly zero. We encourage you to refer to the discussion regarding the concept of "stop gradient." This operation allows the training to continue with a seemingly zero gradient. Alternatively, you can test the effectiveness of stop gradient using a sample draft. In all honesty, during our study, we did not closely monitor the training loss; instead, we solely focused on the valid/test F1 metric. Therefore, we also found it perplexing when someone pointed out that the training loss was 0. However, a knowledgeable reader suggested that this behavior is attributed to the stop gradient process in this repository. Regarding the use of minus, we have conducted extensive testing with various approaches, including using plus and only one. However, using minus has consistently yielded better results. We encourage you to also test the plus and only one approach for comparison purposes. Our assumption is that using minus helps to bring the two representations closer to each other, although there may be other potential explanations for this phenomenon.

On Tue, Oct 31, 2023 at 8:25β€―PM Aobo Liang @.***> wrote:

We have made revisions to the draft in order to address any misunderstandings regarding the methodology. The codes provided reflect the actual implementation used during our experiments, and these codes were responsible for producing the reported results. The reason for the changes in the draft is that version 1 contained some inaccuracies in method description. Rest assured, our code remains unchanged. Yes, there have been no changes to the codes. It is important to note that any modifications could possibly compromise the reproducibility of our experiments. Moreover, we did not employ the same loss as the anomaly transformer; rather, we solely utilized contrastive loss. We encourage you to carefully examine the code. … <#m-9120291717578982799> On Tue, Oct 31, 2023 at 4:53β€―PM Aobo Liang @.> wrote: I notice that the author updated a new paper version of DCdetector on arXiv on 11 Oct. Has the author really confirmed that this version of paper is πŸ†—? From the perspective of model design, the [v1] version of the paper makes more sense logically, especially the Loss function part! It's just that this so-called "Official Code" does not match that in the paper well. All changes made by paper version [v2] appear to be in response to questions raised by readers prior to 18 Sept. Currently I am still confused of the Loss function in this project, it's still the same as Anomaly Transformer https://github.com/thuml/Anomaly-Transformer https://github.com/thuml/Anomaly-Transformer, and there's no other branches of this code, so NO SUBSTANTIAL CHANGES have been made to the current code at all ⁉️⁉️ 😨😨😨? Shouldn't the author change the code for the sake of the paper, rather than changing the paper for the sake of the code❓❓❓❓❓ β€” Reply to this email directly, view it on GitHub <#23 https://github.com/DAMO-DI-ML/KDD2023-DCdetector/issues/23>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3JGO4II4HKN5CAKLI2GULYCC4BXAVCNFSM6AAAAAA6XKAMECVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3DSOJUGU3TANI https://github.com/notifications/unsubscribe-auth/AB3JGO4II4HKN5CAKLI2GULYCC4BXAVCNFSM6AAAAAA6XKAMECVHI2DSMVQWIX3LMV43ASLTON2WKOZRHE3DSOJUGU3TANI . You are receiving this because you are subscribed to this thread.Message ID: @.>

The value "series_loss" and "prior_loss" do come from different calculation process compared with Anomaly Transformer, however, in the [v1] version, the Loss function seems to have a right logic, because it's ok to make KL divergence of two representations from the same normal series points as small as it can, while the Loss in [v2] does not. What confused me is that, as a Deep Learning Model, it's not reasonable that the loss value of each of your epoch is always "0.", right? The code now still gets a zero-loss during its training process, if this is true, what is the insight of this minus operation? Even though the author explained for this issue in #7 https://github.com/DAMO-DI-ML/KDD2023-DCdetector/issues/7 , it's still confusing, because KL Divergence itself could represent the similarity of comparing the two representations, why bother adding another "minus" operation which makes the loss always to zero?

β€” Reply to this email directly, view it on GitHub https://github.com/DAMO-DI-ML/KDD2023-DCdetector/issues/23#issuecomment-1787118283, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3JGOYTKUZ7IXL7OZEZPP3YCDU2HAVCNFSM6AAAAAA6XKAMECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBXGEYTQMRYGM . You are receiving this because you commented.Message ID: @.***>

Leopold2333 commented 11 months ago

Thank you for the answer, I will check about the principle of stop_grad part!

Leopold2333 commented 11 months ago

I recently read some relevant principles of Loss Fuction and How does the model get updated, here is my Understanding and Guessing: When calculating the $\mathcal{L_P}$ and $\mathcal{L_N}$, because of the KL Divergence formula, the gradient has already been generated during this process, that is, the gradient with the direction from $\mathcal{P}$ to ${\mathcal{N}}$ and from $\mathcal{N}$ to ${\mathcal{P}}$. These gradient directions are naturally to reduce the gap of these two representations, and the "minus" operation will not influence too much on this issue, thus either "plus" or "minus" could work. However, if we adopt "plus" operation, the model needs to use two gradients with completely opposite directions to update model parameters, and it can cause extra complexity for the model, because it also has to learn to balance these two learning directions. When adopting "minus" operation, things get changed. The gradient direction of $\mathcal{L_N}$ is from $\mathcal{N}$ to ${\mathcal{P}}$, while the "minus" operation change the direction of $\mathcal{L_P}$ to $\mathcal{N}$ to ${\mathcal{P}}$, which is the inverse of its original direction. This operation (unexpectedly) unifies the gradient directions of two losses to some extent, which further assist in model convergence. Therefore, it just makes difference of using "plus" and "minus", it's πŸ†— to base on claims. I used to think that models need to be trained with the goal of reducing loss values, but now it seems also important to analyze the gradient direction as well. This paper is indeed interesting. Thanks for the authors and their contribution🀠🀠🀠

tianzhou2011 commented 11 months ago

Thank you for sharing your profound insights into the problem ~~ By the way, I am curious if you happen to be students interested in pursuing an internship at Damo Academy. The way you explained the direction is truly fascinating, and your enthusiasm in this field is contagious. If you have any further inquiries, please don't hesitate to reach out to me at @.***

Tian

On Wed, Nov 1, 2023 at 11:42β€―AM Aobo Liang @.***> wrote:

I recently read some relevant principles of Loss Fuction and How does the model get updated, here is my Understanding and Guessing: When calculating the $\mathcal{L_P}$ and $\mathcal{L_N}$, because of the KL Divergence formula, the gradient has already been generated during this process, that is, the gradient with the direction from $\mathcal{P}$ to ${\mathcal{N}}$ and from $\mathcal{N}$ to ${\mathcal{P}}$. These gradient directions are naturally to reduce the gap of these two representations, and the "minus" operation will not influence too much on this issue, thus either "plus" or "minus" could work. However, if we adopt "plus" operation, the model needs to use two gradients with completely opposite directions to update model parameters, and it can cause extra complexity for the model, because it also has to learn to balance these two learning directions. When adopting "minus" operation, things get changed. The gradient direction of $\mathcal{L_N}$ is from $\mathcal{N}$ to ${\mathcal{P}}$, while the "minus" operation change the direction of $\mathcal{L_P}$ to $\mathcal{N}$ to ${\mathcal{P}}$, which is the inverse of its original direction. This operation (unexpectedly) unifies the gradient directions of two losses to some extent, which further assist in model convergence. Therefore, it just makes difference of using "plus" and "minus", it's πŸ†— to base on claims. I used to think that models need to be trained with the goal of reducing loss values, but now it seems also important to analyze the gradient direction as well. This paper is indeed interesting. Thanks for the authors and their contribution🀠🀠🀠

β€” Reply to this email directly, view it on GitHub https://github.com/DAMO-DI-ML/KDD2023-DCdetector/issues/23#issuecomment-1788359692, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB3JGO3D44IBCWUIHUPIBLDYCHAIXAVCNFSM6AAAAAA6XKAMECVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTOOBYGM2TSNRZGI . You are receiving this because you commented.Message ID: @.***>