Closed Yikai1Wang closed 1 month ago
Hi, the code for LLaViLo is currently not publicly available. Regarding anchor tokens, they serve the same purpose as the input query tokens of the DETR decoder. You only need to concatenate them with your language instruction, query, or video embedding as the input for the LLM, and then perform the moment localization loss on those tokens at the output.
您好,我之前关注到您的LLaViLo: Boosting Video Moment Retrieval via Adapter-Based Multimodal Modeling这篇文章,其中我想请问Additional anchor tokens指的是什么,可以参照您的实现代码吗?