RLHF-V / RLAIF-V

RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness
197 stars 6 forks source link

Self feedback data generation pipeline & reference model #6

Closed charismaticchiu closed 2 months ago

charismaticchiu commented 2 months ago

Hi 2 quick questions,

  1. From the paper algorithm1, I get a sense that the algorithm can work in an online divide-n-conquer manner with updated model and I am just curious when the self-feedback code will be released.

  2. In this line the reference model is initialized the same as policy model (weights and dtype), so why wouldn't the log_p be the same and thus the loss be 0 all the time?

Thank you!

Haoye17 commented 2 months ago

Hi @charismaticchiu,

Thank you for your interest in our work! To address your questions:

  1. Self-feedback Code: We have updated our code for generating LLaVA 1.5 feedback data using OmniLMM and MiniCPM-Llama3-V 2.5. You can find detailed instructions for data generation here.
  2. DPO Optimization: Regarding the DPO optimization, because the loss calculation involves a log sigmoid function externally, the loss will not be zero even if the policy and reference models are identical. image

I hope this helps! If you have any other questions, feel free to ask.

charismaticchiu commented 2 months ago

Thanks for the quick reply! Make sense!

Btw, where can I find the question file in this line?

Haoye17 commented 2 months ago

Hi @charismaticchiu,

You can find the question file here~

charismaticchiu commented 2 months ago

Awesome, thanks!

and in the paper the model is trained 4 epochs, but the script shows it was trained 10 epochs, was that a typo? Also I am a bit confused by the steps, which overrides the num_train_epochs. Does that match the implementation details right before sec 3.2?

And do you mind sharing the script for using LLava-Next as supervision model?

yiranyyu commented 2 months ago
  1. We use max_steps to control how many training/optimization steps are needed, rather than num_train_epochs. Actually, we also conducted some further experiments with this codebase before releasing. So the default hyper-parameters written in scripts might not all match what shown in the paper. For reproduction, we refer users to the detail introduction in the paper.
  2. The scripts utilizing LLaVA-NeXT as the labeler model are still being prepared for release. Please stay tuned for updates. In fact, you can simply change the inference code in provided code with those of LLaVA-NeXT for fast adaptation.
darkpromise98 commented 2 months ago

I want to know why the max_steps is set to 2672? Does it have any special meaning?

yiranyyu commented 2 months ago

No, it does not have any special meaning, it is just because the number of steps per epoch is not a pretty number like 1024 or 1500. Feel free to open a new issue if you have any questions.