Closed casillas1111 closed 1 year ago
Hi @casillas1111 ,
Thanks for your attention.
The figures below display the intermediate attention maps of different resolutions (14, 7, 4) for a 224x224 image. We can see that for resolutions of 14x14 and 4x4 (the mid-layer resolution), the attention maps are not accurate, either focusing on other objects (14x14) or just concentrating on nothing (4x4). Thus, we choose to apply our attack on the 7x7 attention map which is more accurate and can be roughly taken as a "pseudo" recognition head of the diffusion model.
The division operation in get_average_attention()
is to get an average of multiple U-Net steps, as there is an accumulation operation of attention maps during the diffusion denoising process.
Hope these can help.
Thank you for your detailed reply, which solved our doubts very well.
Thank you for your excellent work.
We are confused that the cross attention loss (Ltransfer) is calculated when item.shape[1] == num_pixels and res=7 in aggregate_attention() function. Also, why does item need to be divided by cur_step in the get_average_attention() function? It seems that the mid-layer is not considered for computing cross-attention
Looking forward to your reply, thank you very much.