Gumpest / SparseVLMs

Official implementation of paper "SparseVLM: Visual Token Sparsification for Efficient Vision-Language Model Inference" proposed by Peking University and UC Berkeley.
https://arxiv.org/pdf/2410.04417
Apache License 2.0
53 stars 3 forks source link

Intuitive! #1

Closed Yxxxb closed 1 month ago

Yxxxb commented 1 month ago

It's Intuitive! Have you try it in video? As more video frames are added, the stronger the pruning, the more interesting the text guidance will be.

Gumpest commented 1 month ago

@Yxxxb Thanks for your comments 😊.

Our paper currently conducts basic experiments under the Video-LLaVA framework. Perhaps we can see whether the more frames there are, the better the pruning effect.

By the way, your Voco-LLaMA is great, and we have cited it in our paper.

Yxxxb commented 1 month ago

Congrats and thanks!