Open tom68-ll opened 1 year ago
Hello, thank you very much for raising these questions. The reproduction project I worked on used the adult dataset instead of the GLUE dataset, so it cannot be directly compared to the data obtained in the original paper. The accuracy I achieved on the adult dataset was around 80%, which seems to be not a very good result.
Furthermore, I have just revisited the figures in the paper, and indeed, according to the diagrams, it appears that only the cross-entropy loss should be used for back-propagating gradients. However, I still have some doubts because based on the description in the paper, it seems more appropriate to update the gradients using the total loss.
Regarding whether the gradient should be detatched, I think it's a test worth trying, because the operation on this part is not very consistent in different papers. Also, I am interested in another question, may I ask the author how the effect changed in your experiment when two additional loss functions were added than the original cross-entropy loss compared to the previous one?
Yes, regarding whether the gradient should be detached, it's a test worth trying. Can you recommend the papers you have seen to me so that I can refer to them? In my experiments, I added two additional loss functions, but I haven't conducted an experiment using only the cross-entropy loss. I've been busy with other tasks recently, but I will try to conduct an experiment as soon as possible to see how the effect changes. I'm also very curious about it.
Author
The experimental results using only one loss function and using three loss functions are similar.
Sorry for just seeing the reply, I would recommend to you some good work related to comparative learning, they are basically from CV community, such as MOCO series, BYOL, SIMCLR, etc. You can easily access their papers. Regarding your claim that there is no significant improvement in the experimental results, this may not be the core objective of this paper, as it is clear from its experimental results that the method has a significant performance improvement in the OOD scenario
Besides that, I am still a bit confused about the gradient mentioned before, as you can see from Figure 1 of the article, the gradient is passed back to another embedding layer by the gradient on the perturbation R. Since I am not very familiar with FGSM, I am not sure how this step is implemented?
Sorry for that , actually I don't have a very thorough understanding of the FGSM algorithm. All the code in the project was obtained using ChatGPT, so there might be some issues.
Hi, author, Thank you very much for the reproduction project. Could you please briefly provide the results of the experiment for our reference? Besides, I think the gradient settings of the three loss functions may be a little different from the original. As you can see from the structure diagram of the original article, the gradient seems to be back-propagated only from the original cross-entropy loss