Hi. I want to ask, how to compute the cross attention map of an existing image? For example, I already have an image of a human, my text is "a man is jumping". I want to get the heat map of word "jump".
In your code, it seems that the heat map is collected from the generation process.
Hi. I want to ask, how to compute the cross attention map of an existing image? For example, I already have an image of a human, my text is "a man is jumping". I want to get the heat map of word "jump".
In your code, it seems that the heat map is collected from the generation process.
Thanks.