Extract context relevancy

Hi, first of all I want to express my congratulations for such a work. The models performs pretty well considering the nature of the tasks.

I want to use the model to create video summaries, for that purpose I think that the best approach would be to determine which parts of the input video have the higher attention or context score. Afterall, the LLM will use that to elaborate the textual summary. I'm struggling to do so. Now, I'm working on the token generation function, but I'm unsure about my code. Could someone bring me some help?

This is my current piece of code:

dvlab-research / LLaMA-VID

Extract context relevancy #86