THUDM / CogVLM2

GPT4V-level open-source multi-modal model based on Llama3-8B
Apache License 2.0
2.05k stars 139 forks source link

Question about Caption Model #195

Open zhiyuanyou opened 2 hours ago

zhiyuanyou commented 2 hours ago

Hello,

Thanks for you great work! I am trying to caption some videos using your caption model, THUDM/cogvlm2-llama3-caption. However, I meet the following warnings:

This is a friendly reminder - the current text generation call will exceed the model's predefined maximum length (2048). Depending on the model, you may observe exceptions, performance degradation, or nothing at all.

I can revise the max_position_embeddings parameter in THUDM/cogvlm2-llama3-caption/config.json to change the predefined maximum length.

However, I am not sure whether directly changing the max_position_embeddings parameter will degrade the performance.

Thanks for your time.

zhiyuanyou commented 2 hours ago

Thanks for you great work again! I am also curious about three questions.

  1. How many frames for one video should I input to get the best performance?
  2. Considering the name max_position_embeddings, are the position_embeddings sine embeddings or learned embeddings?
  3. If I change the max_position_embeddings from 2048 to 4096, is there any interpolation operation to obtain 4096 embeddings from predefined 2048 embeddings?

Thanks in advance.