Hello, I have recently implemented a cross attention application with multi-modal fusion, but because the image resolution is too large, cuda OOM occurs when calculating q and k, so I found your paper and hope to use it to reduce the consumption of computing resources. May I ask? Can your concept be applied to cross attention? Is it equivalent to calculating k and v of input2 in advance, and then using a matrix to calculate qw of input1? thank you
Hello, I have recently implemented a cross attention application with multi-modal fusion, but because the image resolution is too large, cuda OOM occurs when calculating q and k, so I found your paper and hope to use it to reduce the consumption of computing resources. May I ask? Can your concept be applied to cross attention? Is it equivalent to calculating k and v of input2 in advance, and then using a matrix to calculate qw of input1? thank you