Problem about replicating results

RobinWitch commented 4 months ago

What's a nice job! And code is easy to run. While I have some problems in replicating eval results. Here is my process.

I have download ckpt llava-1.5-7b-D-inBC + Aux(B) trained on VIMA-80k Hugging Face;
Then build a new empty directory myresults and use command cd eval && python3 eval-llara.py D-inBC-AuxB-VIMA-80k --model-path ../checkpoints/llava-1.5-7b-llara-D-inBC-Aux-B-VIMA-80k --prompt-mode hso --output-path ../myresults/;
I also cp results/llara-result.ipynb to ./myresults;
In ./myresults directory, I run llara-result.ipynb to get the final result, but the result is too bad;

Doing Step 2 and 3 is to get a new json result.

What mistakes happend in my process? Could anyone point out for me?

Besides, thanks authors to share training logs for us. I found the learning rate is changing among training according to ./checkpoints/llava-1.5-7b-llara-D-inBC-Aux-B-VIMA-80k/trainer_state.json. Which schedual is used in training? Following your guide, the learning rate is always 2e-05 except warm up stage.

LostXine commented 4 months ago

Hi @RobinWitch

Thank you for trying out our code! I realize that my instructions in readme were incorrect and I apologize for the confusion. Given your current environment, could you navigate to the train-llava directory and run pip install -e .? Basically:

cd train-llava && pip install -e .

Could you start eval-llava.py again after this package update? The reason behind is that the current evaluation code should work with a modified version of llava that can take multiple visual inputs, although we currently only have one image per conversation in this setting.

Again, really sorry for the previous misleading instructions.

Regarding the learning rate, I'm checking with a co-author who conducts experiments on a different cluster and will get back to you as soon as possible. In the meantime, I can confirm that other experiments trained on VIMA-0.8k and VIMA-8k have the desired learning rate.

Thank you for your understanding.

Best regards, Xiang

LostXine commented 4 months ago

Hi @RobinWitch

I can confirm the cosine schedule causes the learning rate you observed (see the figure below). We will further clarify this in the next version of this paper and double-check all the existing experiments to ensure they have the same behavior. Thanks for pointing it out.

# Code to make this figure
import json
import matplotlib.pyplot as plt

with open('checkpoints/llava-1.5-7b-D-inBC-Aux-B-train-80k/trainer_state.json', 'r') as f:
    data = json.load(f)
lr = [i['learning_rate'] for i in data['log_history'][:-1]]
plt.plot(lr)

Please let me know if you have further questions. Otherwise, feel free to close this issue.

Best regards, Xiang

RobinWitch commented 3 months ago

Hello @LostXine, thank you a lot for carefully reply my issue. After following your guidance to install train-llava, I successfully run eval code! The result is shown as below. There is a liitle decrease on L1 compared to your result (83.5% <-> 90%). Is this error acceptable? Or may I still have something wrong?

Also, I use D-inBC-text-multi-train-80k-front.json as data_path training 8 epoch, it cost about 160 hours on four RTX 4090 and 270G RAM. Is it normal? ( I'm not sure because I have not found hardware requirement in both paper and git repository)

LostXine commented 3 months ago

Hi @RobinWitch

Thanks for the follow-up!

This error is larger than I expected. Could you share your result file here so I can take a look as well? Meanwhile, as you may already know, in the result JSON file, all the raw questions and answers are recorded in lm_prompt_hist and lm_answer_hist correspondingly. You can compare your results with my results at [hso]D-inBC-AuxB-VIMA-80k.json and see the difference. It would be very great if we could fix the configuration so that all the results can be replicated perfectly.

The minimum hardware requirement for training is listed at https://github.com/LostXine/LLaRA/blob/abf04533e057ad51d3fd176f14507ef601237412/train-llava/README.md?plain=1#L18-L19 We will report it in the next version of the paper as well.

The training time is a little bit longer than I expected but it's still reasonable considering the offload overhead.

Thanks,

RobinWitch commented 3 months ago

Hello @LostXine, thanks for your quickly reply. Here are my result files. I repeating running it for three times. [hso]D-inBC-AuxB-VIMA-80k_1.json [hso]D-inBC-AuxB-VIMA-80k_2.json [hso]D-inBC-AuxB-VIMA-80k_3.json Just a little case is different among them but lots different from [hso]D-inBC-AuxB-VIMA-80k.json. The ckpt I used is down from huggingface llava-1.5-7b-D-inBC + Aux(B)

And I apologize for forgetting hardware requirement in your repository.

LostXine commented 3 months ago

Hi @RobinWitch ,

Thanks for the detailed info. After a quick glance, I tentatively conclude that the difference is caused by the random user_prompt when preparing the prompt. (see https://github.com/LostXine/LLaRA/blob/abf04533e057ad51d3fd176f14507ef601237412/eval/vima_utils.py#L241-L252C63). Frankly speaking, I'm very surpised to learn the big difference it can cause and we will definitely investigate it more later. Could you email me (xiangli8@cs.stonybrook.edu) your information so that we can acknowledge your contribution in the paper (only if you want to)?

I will be travelling this weekend and I'll try my best to get back to you with more observations next week.

Thanks,

LostXine commented 3 months ago

Hi @RobinWitch and future visitors,

I realized that I mistakenly uploaded the Aux-D model instead of the Aux-B model to the repository llava-1.5-7b-llara-D-inBC-Aux-B-VIMA-80k on Hugging Face. This issue has now been corrected, and you can download the correct files from this link.

Again sorry for the confusion. I will update the README files shortly.

Meanwhile, in the coming update, I will include an ablation study on user_prompt, please stay tuned!

Best regards,

RobinWitch commented 3 months ago

Hello, @LostXine, Thank you so much, after using the new ckpt, I can get the same results as you provided ! I also use 0.8k and 8k data to train model from scatch, surprisingly getting a even better results than paper ! Thanks a lot again !!!

LostXine / LLaRA

Problem about replicating results #3