Closed mvsoom closed 2 months ago
Hi, we have released the finetuned checkpoints for each video dataset, including activitynet, breakfast, coin, lvu, msrvtt, msvd, and youcook2. Our model can also leverage pre-trained weights from InstructBlip without any finetuning to conduct zero-shot evaluation on video datasets. You can refer to the README.md for more details.
Thanks for the reply, I see. Amazing that this just works in plug-and-play fashion.
Are the InstructBlip weights downloaded automatically? So only the Vicuna weights need to be downloaded manually?
Maybe a quick example on a test video (like a captioning task) would be beneficial. If I can get it to work in the coming days (new to LAVIS), I'll submit a PR.
Hello, the demo.ipynb is now released. You can test it with your example video.
Much thanks!
Hi @boheumd ; very interesting work For off-the-shelf testing,(zero-shot evaluation), do we need to load any other weights, other than pre-loaded checkpoint from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth (which it loads automatically)
If above sounds true, can you confirm for LVU relationship task in off-the-shelf setting
top1: 0.00 top5: 0.00 or I might be doing something wrong
Hi @boheumd ; very interesting work For off-the-shelf testing,(zero-shot evaluation), do we need to load any other weights, other than pre-loaded checkpoint from https://storage.googleapis.com/sfr-vision-language-research/LAVIS/models/InstructBLIP/instruct_blip_vicuna7b_trimmed.pth (which it loads automatically)
If above sounds true, can you confirm for LVU relationship task in off-the-shelf setting
top1: 0.00 top5: 0.00 or I might be doing something wrong
Hi. For the LVU off-the-shelf setting, you need to add all the labels into the prompt when doing inference. For example, for the genre classification task, you need to set the prompt like this "please select the genre type of the input video from the list [action, comedy, romance, thriller]". Without the given answers, it is a hard task and the output accuracy might be 0.
Any chance of releasing the weights? I currently lack the compute to train this myself. Thanks!