Training Configuration and Model Performance

ShogoAkiyama commented 6 months ago

@ChenYi99 Hello,

First and foremost, I want to express my gratitude for providing such an excellent paper and code.

I am reaching out because I encountered an issue while using your provided codebase. Specifically, I have concerns regarding the correctness of the training configuration parameters.

I followed your publicly available train_config and trained the model for 10 epochs with 3000 iterations. However, when evaluating the model on EpicKitchen dataset, the achieved accuracy was only around 44%. I also experimented with the pretrained weights from EgoPlanVideoLLaMA that you have shared, which yielded approximately 54% accuracy, consistent with the results reported in the paper.

I am keen on achieving similar results to those reported in the paper. Therefore, I am reaching out to inquire whether there might be any discrepancies in the parameters or if there are any additional considerations I should take into account.

Best regards

ChenYi99 commented 6 months ago

Thanks for your interest in our work. I will check the codes as soon as possible.

ChenYi99 commented 6 months ago

Hi,

I have checked my codes again and reimplemented the experiments. The results are consistent with our paper. Can you describe more details about your settings?

ShogoAkiyama commented 6 months ago

@ChenYi99

Thank you for taking the time to review. I appreciate your effort and time spent on this.

Upon checking on my end, it appears that I was running torch.distributed with a setting of 1 due to computational constraints. In this scenario, when setting torch.distributed to 8, it seems that the batch_size differs. Therefore, I am currently running and verifying with a gradient accumulation of 8.

Once the results are available, I will communicate them again.

Thank you.

ShogoAkiyama commented 6 months ago

@ChenYi99

I'm sorry to ask many questions.

Due to having only one GPU(A6000), I executed the code with accum_grad_iters set to 8, but the evaluation at 10 epochs still reached only about 45%. I plan to try running it with 8 GPUs and nproc=8 around next week.(hardware limitation)🙇

I'm getting the following message during execution; is this something I should be concerned about?

Failed to load examples with video: P04_120. Will randomly sample an example as a replacement.
Failed to extract key frames: vlen(=4) < n_frms !!!
sample_id: 7097, video_id: P08_01
start_frame_idx: 34921, stop_frame_idx: 34924
current_observation_frame_idx: 35337
action_metadata: {'narration_text': 'open drawer', 'start_frame': 34921, 'stop_frame': 35088}
most_recent_actions_metadata: [{'narration_text': 'open drawer', 'start_frame': 34921, 'stop_frame': 35088}, {'narration_text': 'open dishwasher', 'start_frame': 34924, 'stop_frame': 34994}, {'narration_text': 'place chopping board', 'start_frame': 35124, 'stop_frame': 35260}, {'narration_text': 'close drawer', 'start_frame': 35267, 'stop_frame': 35337}]

And, for reference, the loss turned out as follows, but if it could go lower? epoch 10

{"train_lr": "0.000", "train_loss": "1.402"}
{"train_lr": "0.000", "train_loss": "1.198"}
{"train_lr": "0.000", "train_loss": "1.199"}
{"train_lr": "0.000", "train_loss": "1.159"}
{"train_lr": "0.000", "train_loss": "1.153"}
{"train_lr": "0.000", "train_loss": "1.145"}
{"train_lr": "0.000", "train_loss": "1.127"}
{"train_lr": "0.000", "train_loss": "1.109"}
{"train_lr": "0.000", "train_loss": "1.090"}
{"train_lr": "0.000", "train_loss": "1.091"}

ChenYi99 commented 6 months ago

Hi, when you set "accum_grad_iters" to 8, did you also increase "iters_per_epoch" from 3000 to 24000? Additionally, you can ignore the message, and the loss you presented is within the expected range.

ShogoAkiyama commented 6 months ago

@ChenYi99 Thank you for the advice.

With your help, by setting batch_size=2, accum_grad_iters=8, and iters_per_epoch=24000, I was able to achieve accuracy on the EpicKitchen evaluation comparable to that reported in the paper. Thank you very much.

I have another question. Since the execution takes about 2 days, I'm considering changing batch_size=4, accum_grad_iters=4, and reducing iters_per_epoch=12000. Do you think this will yield the same results? I would appreciate any insights you might have. Apologies for the repeated questions, but thank you in advance for your assistance.

ChenYi99 commented 6 months ago

I guess the result should be similar, you can give it a try.

ShogoAkiyama commented 6 months ago

@ChenYi99

Thank you for your reply. I will try changing the parameters and check again.

I was able to reproduce the results as described in the paper, so I would like to close this issue. Thank you very much for your prompt and courteous responses to my questions.

ChenYi99 / EgoPlan

Training Configuration and Model Performance #1