About VILADistributedSampler and gradient_accumulation_steps

NVlabs / VILA

VILA - a multi-image visual language model with training, inference and evaluation recipe, deployable from cloud to edge (Jetson Orin and laptops)

Apache License 2.0

970 stars 68 forks source link

About VILADistributedSampler and gradient_accumulation_steps #69

Open dreamerlin opened 1 month ago

dreamerlin commented 1 month ago

If we use the VILADistributedSampler (https://github.com/Efficient-Large-Model/VILA/blob/main/llava/train/llava_trainer.py#L274-L281) for Distributed Training, should the gradient_accumulation_steps be hardcoded to 1? Since I notice that when I use 4 nodes (8 GPUs per node) to training, and set gradient_accumulation_steps to 8, the speed is fast but I think it is abnormal.

yaolug commented 1 month ago

We have tried gradient_accumulation_steps to be 2 and 4 and it seems reasonable.