RLHF-V / RLAIF-V

RLAIF-V: Aligning MLLMs through Open-Source AI Feedback for Super GPT-4V Trustworthiness
200 stars 6 forks source link

Questions Regarding the Training Data and Settings for LLaVA as Used in the Paper #15

Closed Timsty1 closed 1 month ago

Timsty1 commented 1 month ago

Hello,

Thank you for your excellent work in the multi-modal alignment area!

I’m currently trying to reproduce the results of LLaVA 1.5 7B as reported in your paper and have a few questions regarding the training data and settings.

Q1: Which part of the data is used for training LLaVA? What is the amount of training data?

The released RLAIF dataset contains a total of 83,132 samples and can be split into three parts based on the type of inference model (LLaVA, OmniLMM, MiniCPM-LLama3-V). Is all the data used for training LLaVA, or just the corresponding part?

Additionally, the global training batch size in the default training config is set to 8. According to the “trainer_state.json” file in the released weights of RLAIF-7B, an epoch contains 750 steps, suggesting the amount of training data should be around 6K, which doesn’t match the amount of the open-sourced data. Could you clarify this discrepancy?

Q2: Performance Evaluation

I evaluated the performance of the released RLAIF-7B model on Object HalBench and obtained scores of (9.6, 5.0), which are not as good as the results (8.5, 4.3) reported in the paper. Is this a normal margin of error in ChatGPT evaluation?

Looking forward to your response. Thank you!

Spring24ch commented 1 month ago

我想问下 数据中logps怎么来的

Haoye17 commented 1 month ago

Hello @Timsty1, thank you for being interested in our work!

The open-source dataset includes the data collections used when training different models with the RLAIF-V method. For training the LLaVA model, we used the part of the dataset where the policy model is LLaVA-1.5-7B, that is parquet-006 to parquet-009. Each parquet file corresponds to an iteration of data, with approximately 5.4k data points per iteration.

Regarding the ObjHal Bench evaluation, we think this discrepancy might occur. As you mentioned, the evaluation by ChatGPT might introduce variability, and factors such as different GPU hardware devices could also impact the model’s inference results, leading to discrepancies in evaluation.

If you have any further questions, we are happy to assist!

Timsty1 commented 1 month ago

Hello @Haoye17 , thank you very much for your detailed response! It addressed my questions :)

Molly-3000 commented 1 week ago

Hi!

I'm having some trouble reproducing the results from the paper. Have you managed to do so successfully? @Timsty1

I trained LLaVA 1.5-7B for 4 iteration, using parquet files sequentially from parquet-006 to parquet-009, one by one. However, the evaluation results of the final model are far from those of the released model.

For each iteration, I used 8xA100 GPUs with num_train_epochs=4, per_device_train_batch_size=1, gradient_accumulation_steps=1, learning_rate=5e-7, and dpo_beta=0.1, as specified in the paper.

Any additional details would be greatly appreciated. Thank you! @Haoye17