WindVChen / DiffAttack

An unrestricted attack based on diffusion models that can achieve both good transferability and imperceptibility.
Apache License 2.0
158 stars 14 forks source link

Question about the text perturbation. #21

Open caosip opened 4 months ago

caosip commented 4 months ago

Hello, I'm really interested in your work! However, I have some questions about the adversarial attack with text perturbation. In Table 5, the adversarial attack with only perturbation on the text could also work, although not the most effective. Could you provide the reproduction codes for this part? I'm curious about how to achieve such results because it seems that text encoders (such as CLIP) don't inherently include semantics about adversarial noise. I would greatly appreciate for your reply!

WindVChen commented 4 months ago

Hi @caosip,

Sorry for the late reply; I've been a bit busy lately.

Regarding the code for the adversarial text perturbation in Appendix C, I'll try to locate it, but please don't get your hopes up too much. It's been a while, and I may not have it anymore due to some file mismanagement. :(

However, reproducing the results should be straightforward by applying some modifications to the code available in this repository. The mechanisms behind the text perturbation are explained in Appendix C. The key point is to make the text embeddings optimizable, using something like nn.Parameter to backpropagate gradients and update the embeddings. If you need any specific details for your reproduction, feel free to ask.

As for the comment about "text encoders (such as CLIP) not inherently including semantics about adversarial noise", I'm not entirely sure what is meant by "semantics about adversarial noise". My understanding is that since the prompt text guides the generation direction of the diffusion model, perturbing the prompt text can bias the generation direction and create images that might deceive the classifier.

Hope this helps!

caosip commented 4 months ago

Thank you @WindVChen,

I initially thought that text perturbation was used to perform the token-level perturbations. I tried some discrete optimization methods such as Greedy Coordinate Gradient-based search, to find suitable prompts for generating adversarial examples, but it worked badly. Therefore, I suspected the text encoder may not have learned the representation of "adversarial noise."

It seems that you are conducting adversarial attacks on word embeddings. I'll try this following your paper :)

Once again, thanks for your reply!

WindVChen commented 4 months ago

Yes, the discrete nature of prompt text offers a more limited search space for adversarial perturbations, which can result in poorer performance. If you want to conduct perturbations on discrete text, one approach could be to find perturbations in the continuous embeddings and then select the nearest hard prompt. You might find this work useful as a reference. Yet, it might still fail to generate effective attacks.