How can I implement a function like joint image filtering?

Hi, thanks for your great work and open-source code!

In this code, the input feature first computes weights by qk and aggregates itself in a local window. The qkv is different mappings for the same feature. I want to implement a function like joint image filtering. I have a reference feature and a target feature. I want to obtain weights by the reference feature, and then use the weights to aggregate the target feature. Namly, the qk is different mappings for the reference feature while v is the mapping for the target feature.

Please how can I achieve such functionality using existing code? Thanks!

SHI-Labs / Neighborhood-Attention-Transformer

How can I implement a function like joint image filtering? #9