CeeZh / LLoVi

Official implementation for "A Simple LLM Framework for Long-Range Video Question-Answering"
MIT License
78 stars 4 forks source link

Please provide full narration hyper-parameters #5

Open maximotus opened 2 months ago

maximotus commented 2 months ago

Hi ;)

for comparability reasons, it would be beneficial for the community to have insights into the full hyper-parameter setups. I am especially interested in the LaViLa captioning config to use with your provided fair checkpoint for EgoSchema. In detail, I need the following information:

  1. You say you use nucleus sampling with top_p=0.95 and choose k=5 for having 5 candidates. What temperature do you use?
  2. Besides, I saw you reporting in the paper that you use a temperature=0.0 for the LLMs, but I see in the readme you provide example commands for the summarization task with temperature=1.0. So do you use temperature=1.0 for LLM in summarization task and temperature=0.0 in QA task?

Clearification would be much appreciated! :)

Cheers, Maximotus

CeeZh commented 2 months ago

Hi,

Sorry for the late reply. I was quite overwhelmed these days.

For your question:

  1. You say you use nucleus sampling with top_p=0.95 and choose k=5 for having 5 candidates. What temperature do you use? We used temperature=0.7. Basically, we follow the default haprams setting of LaViLa, except that we change caption-num-return-sequences to 5 to accelerate. [image: image.png]

  2. Moreover, in the paper you say that you take the narration with the largest confidence score as the final caption. I assume you mean by that taking the candidate with the lowest perplexity (since LaViLa caption model returns output token ids together with the perplexity values)? Thanks for pointing it out! I double checked our code for captioning, and found that we were actually picking up the caption that has the largest perplexity. While we made fair comparisons in our paper, this is definitely a suboptimal solution.

  3. Besides, I saw you reporting in the paper that you use a temperature=0.0 for the LLMs, but I see in the readme you provide example commands for the summarization task with temperature=1.0. So do you use temperature=1.0 for LLM in summarization task and temperature=0.0 in QA task? Yes, we only use temperature=1.0 in the summarization task because we found it better practically. For QA tasks we use temperature = 0.0.

Thanks for your questions, and good luck with your future research!

On Wed, Apr 24, 2024 at 3:50 AM maximotus @.***> wrote:

Hi ;)

for comparability reasons, it would be beneficial for the community to have insights into the full hyper-parameter setups. I am especially interested in the LaViLa captioning config to use with your provided fair checkpoint for EgoSchema. In detail, I need the following information:

  1. You say you use nucleus sampling with top_p=0.95 and choose k=5 for having 5 candidates. What temperature do you use?
  2. Moreover, in the paper you say that you take the narration with the largest confidence score as the final caption. I assume you mean by that taking the candidate with the lowest perplexity (since LaViLa caption model returns output token ids together with the perplexity values)?
  3. Besides, I saw you reporting in the paper that you use a temperature=0.0 for the LLMs, but I see in the readme you provide example commands for the summarization task with temperature=1.0. So do you use temperature=1.0 for LLM in summarization task and temperature=0.0 in QA task?

Clearification would be much appreciated! :)

Cheers, Maximotus

— Reply to this email directly, view it on GitHub https://github.com/CeeZh/LLoVi/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AKXZD56IABN5UXGWXKRJQHLY65P35AVCNFSM6AAAAABGWMFHOCVHI2DSMVQWIX3LMV43ASLTON2WKOZSGI3DANRSGQ2TGMI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

maximotus commented 1 month ago

Thank you very much for the provided information, that is very helpful!

May I ask why you do not provide ChatGPT 4 results using your sum + qa pipeline (or have you tested it, and if so, could you share the results)? And I also wonder why you have used ChatGPT 4 for NextQA and IntentQA, but there without your proposed sum + qa pipeline, but just with the qa. Did you also test your sum + qa on these datasets?

More insights would be much appreciated! :)

Cheers, Maximotus

CeeZh commented 1 week ago

Hi Maximotus,

Sorry for the late reply.

I had a simple try on sum+qa using GPT-4 on EgoSchema subset. I did not observe a large improvement from the sum + qa pipeline. This is probably because: 1) hparams such as num_words need to be tuned more, 2) GPT-4 itself is strong enough so prompting does not bring more improvements anymore. Considering that GPT-4 is much more expensive than GPT-3.5 (a lot money to make sum+qa work), we decided to focus more on the GPT-3.5 experiments.

For NextQA, I tried sum + qa but did not get improvement, so I just used the standard prompt. In our latest submission, we reported both GPT-3.5 and GPT-4 results on next-qa. Screenshot 2024-07-08 014007 Hope this information help you.