The original DAAM divides the attention maps into 2 chunks, one corresponding to text conditions and the other unconditional. It then uses text conditioning to calculate cross-attention maps.
InstructPix2Pix has 3 chunks, one corresponding to text conditions, one for image conditions and the other for unconditional.
The modified DAAM checks if the size of map_ is 24 instead of the original 16. If true, it divides the maps into 3 chunks instead of 2 allowing the attention maps to be generated correctly.
The original DAAM divides the attention maps into 2 chunks, one corresponding to text conditions and the other unconditional. It then uses text conditioning to calculate cross-attention maps.
InstructPix2Pix has 3 chunks, one corresponding to text conditions, one for image conditions and the other for unconditional.
The modified DAAM checks if the size of
map_
is 24 instead of the original 16. If true, it divides the maps into 3 chunks instead of 2 allowing the attention maps to be generated correctly.Old attention maps(All maps have the same output: