dvlab-research / LLaMA-VID

Official Implementation for LLaMA-VID: An Image is Worth 2 Tokens in Large Language Models
Apache License 2.0
622 stars 39 forks source link

Extract context relevancy #86

Open IgnacioSan22 opened 2 months ago

IgnacioSan22 commented 2 months ago

Hi, first of all I want to express my congratulations for such a work. The models performs pretty well considering the nature of the tasks.

I want to use the model to create video summaries, for that purpose I think that the best approach would be to determine which parts of the input video have the higher attention or context score. Afterall, the LLM will use that to elaborate the textual summary. I'm struggling to do so. Now, I'm working on the token generation function, but I'm unsure about my code. Could someone bring me some help?

This is my current piece of code: image