NVIDIA / Megatron-LM

Ongoing research training transformer models at scale
https://docs.nvidia.com/megatron-core/developer-guide/latest/user-guide/index.html#quick-start
Other
10.72k stars 2.4k forks source link

[QUESTION] tensor_parallel.broadcast_data and train_valid_test_datasets_provider.is_distributed = True #1125

Open KookHoiKim opened 2 months ago

KookHoiKim commented 2 months ago

In my understanding, in pretrain code, it broadcasts the data from tp rank 0 to the rest tp rank gpus.

However, if i activate the option train_valid_test_datasets_provider.is_distributed = True while building dataloader, dataloader would be initialized on every gpus. And it seems they return same data on every iteration. Then what does tensor_parallel.broadcast_data do for in this case?

I am not sure that i understood the procedure of broadcasting data , so i would be very grateful if give me any information about this. Thanks.

github-actions[bot] commented 3 weeks ago

Marking as stale. No activity in 60 days.