Closed SirRob1997 closed 4 years ago
Hi, thanks for your question. L.138 nonzero function inverts the mask again, so it means masking for Top-p percentage of the samples. Do I explain it clearly? About the purpose of subtracting 1e-5, I just try to avoid some corner cases, for example, if change_vector's elements are all zero.
The implementation of the batching part seems quite unintuitive for me, maybe you can clear up some of my understanding:
We calculate the
before_vector
andafter_vector
which represent the class probabilities for the correct class before and after applying the masking for certain samples inside each batch.Next, we subtract the
before_vector
from theafter_vector
which means entries inchange_vector
represent if the masking makes our classifier more / less certain about the correct class for that specific sample. This is represented by negative (more) and positive (less) values insidechange_vector
.We are only interested in the positive values, cases where masking decreases confidence, hence we calculate the threshold for Top-p according to only the positive values as done in L.134 and in L.135.
Next, we check which entries are greater than our threshold in L.136, this yields a binary mask.
This is where my question comes in:
L.137 basically inverts the mask. So instead of reverting the masking for Top-p percentage of samples where it decreases confidence, we are now reverting it for all samples besides Top-p?
Am I correct on this? Why was this done? For self-challenging, applying the masking for Top-p percentage of the samples with negative values seems more intuitive.
Also, while you're at it:
What is the purpose of subtracting 1e-5 in L.133? For me, this seems like a "threshold" (epsilon) i.e. the minimum confidence change to keep the masking. How did the performance change without it? In theory, this would be another hyperparameter