application in cross transformer

Hello,

I've been exploring your work on Cross Transformer, and I'm intrigued by the potential of integrating Agent Attention into this architecture. Agent Attention, as a method to balance computational efficiency and representation power, seems like it could complement the Cross Transformer's design quite well.

I'm particularly interested in understanding how Agent Attention might be incorporated into the Cross Transformer framework. Specifically, my questions are:

How many agent tokens would be optimal in the context of Cross Transformer, and how should they be initialized?
In the existing architecture, what would be the best way to integrate agent tokens in terms of the attention mechanism – should they replace or complement the current attention queries or keys?
Are there any specific considerations or potential challenges you foresee in adapting Agent Attention to this context?

Any insights or suggestions you could provide would be greatly appreciated. I believe such an integration could offer a promising direction for further research and application.

Thank you for your time and for the impactful work you've shared with the community.

Best regards,

LeapLabTHU / Agent-Attention

application in cross transformer #12