Closed bhattg closed 3 years ago
Yes, this is correct. As we point out in section 4.1 in our paper, this is a choice that amounts to applying an additional consistency regularisation objective. This is key in getting better performance, not only for DER, but also for all other rehearsa-based methods (especially ER).
Thanks a lot for the lightning-fast response. I will have a look at the papers you have cited in sec 4.1. Although I haven't read those papers, intuitively what you said makes a lot of sense. However, when mathematically writing down the loss term with the following notations
Let f(x, \theta) be the network giving logits in R^d. Let x' denotes the applied transformation, and \theta' denotes any new parameter after some gradient steps. Therefore, the loss at the moment can be described as
Err(x, x', \theta, \theta') = || f(x', \theta') - f(x, \theta) ||
This can be upper bounded by -
Err(x, x', \theta, \theta') \leq || f(x', \theta') - f(x, \theta') || + || f(x, \theta') - f(x, \theta) ||
= (Consistency loss at \theta') + Dark experience distillation of logits
Therefore, is there any other way of mathematically showing that minimizing Err will lead to the minimization of the sum on the RHS?
Once again, thanks for the quick response!
I'm not sure I understand the question correctly: how can one derive that the first expression (what we apply) is exactly an upper bound of the RHS of the second expression (the sum of consistency and DER loss)?
I do not really think that we should see the combination of the two losses as a straightforward sum, but rather as a kind of composition: we are computing DER loss on top of the application of consistency regularisation.
Since we are assuming that the model should learn to disregard augmentations when producing a response, DER loss is all the more valid when distinct augmentations are at play.
Thanks for the response. I might not have properly conveyed my question. Quoting the line from sec 4.1 -
"It is worth noting that combining data augmentation with our regularization objective enforces an implicit consistency loss, which aligns predictions for the same example subjected to small data transformations."
The first expression Err(x, x', \theta, \theta') = || f(x', \theta') - f(x, \theta) || is optimized (in addition to the cross entropy loss on the current examples) during the training. While the RHS is an upper bound on the || f(x', \theta') - f(x, \theta) ||. That is,
Err(x, x', \theta, \theta') \leq || f(x', \theta') - f(x, \theta') || + || f(x, \theta') - f(x, \theta) ||
"Since we are assuming that the model should learn to disregard augmentations when producing a response"
My question is from a mathematical point of view. While in practice it happens that the model learns to disregard the augmentation, is there a mathematical way to show that starting from optimizing Err (in addition to cross-entropy loss on the current task) also leads to the minimization of the consistency loss?
Hello,
I have a question related to the der.py and buffer.py, specifically related to the applied transforms for the data augmentation. Following are the transformation used in der for split cifar 10
While we store the samples in the buffer, we always save the non-augmented inputs and the corresponding logits, as shown in the following snippet.
Now, when we call get_all_elements or get_data
It applies the set of transformations, on the non_augmented examples.
Now my question is the following- when we request for the elements from the buffer, would the transformation on the top of non_augmented images still be the same, which generated the corresponding logits.? Since the transforms are stochastic, (crop/flip), it seems to give different example in contrast with the original transformed input, which generated the logits.
It will be great if you can answer my query, thanks!!