When I show Video-LLava a short video, given inp = 'Could you please provide a detailed description for this video? Your comprehensive video caption should allow listeners to visualize the scene without actually watching the video. Note that the generated text tokens should not exceed 77!'
But I found that the length of the text tokens it generated was always greater than 77. How should I input inp or adjust the model to make its output meet my requirements?
(Because I want to use CLIP to process the generated text tokens later, I want to limit the length to within 77.)
When I show Video-LLava a short video, given inp = 'Could you please provide a detailed description for this video? Your comprehensive video caption should allow listeners to visualize the scene without actually watching the video. Note that the generated text tokens should not exceed 77!' But I found that the length of the text tokens it generated was always greater than 77. How should I input inp or adjust the model to make its output meet my requirements? (Because I want to use CLIP to process the generated text tokens later, I want to limit the length to within 77.)