Question about the batch part

The implementation of the batching part seems quite unintuitive for me, maybe you can clear up some of my understanding:

We calculate the before_vector and after_vector which represent the class probabilities for the correct class before and after applying the masking for certain samples inside each batch.

Next, we subtract the before_vector from the after_vector which means entries in change_vector represent if the masking makes our classifier more / less certain about the correct class for that specific sample. This is represented by negative (more) and positive (less) values inside change_vector.

We are only interested in the positive values, cases where masking decreases confidence, hence we calculate the threshold for Top-p according to only the positive values as done in L.134 and in L.135.

Next, we check which entries are greater than our threshold in L.136, this yields a binary mask.

This is where my question comes in:

L.137 basically inverts the mask. So instead of reverting the masking for Top-p percentage of samples where it decreases confidence, we are now reverting it for all samples besides Top-p?

Am I correct on this? Why was this done? For self-challenging, applying the masking for Top-p percentage of the samples with negative values seems more intuitive.

Also, while you're at it:

What is the purpose of subtracting 1e-5 in L.133? For me, this seems like a "threshold" (epsilon) i.e. the minimum confidence change to keep the masking. How did the performance change without it? In theory, this would be another hyperparameter

DeLightCMU / RSC

Question about the batch part #10

This is where my question comes in:

Also, while you're at it: