Closed Kunlun-Zhu closed 2 years ago
DistributedDataloader is in ModelCenter, and it is not a necessary design for bmtrain. It is just to make fine-tuning more convenient.
DistributedDataloader is in ModelCenter, and it is not a necessary design for bmtrain. It is just to make fine-tuning more convenient.
Thanks for the response, may I further ask what do you mean by more convenient. Wouldn't it affect the training results or time in one way or another?
like torch.utils.data.distributed.DistributedSampler
, different processes read a different part of the data for data-parallel training ( ZeRO3 can be viewed as a type of data-parallel ). The class DistributedDataloader just wraps it.
like
torch.utils.data.distributed.DistributedSampler
, different processes read a different part of the data for data-parallel training ( ZeRO3 can be viewed as a type of data-parallel ). The class DistributedDataloader just wraps it.
Thanks for the further information. So as to speak, without using the distributedsampler, the dataloader will sent the same batch of data for each processes or GPU? So using the distributedsampler seems will be much faster? I am currently facing the problem that after using Bmtrain the training time would be 3x or 4x slower with the same batch size. Would that be normal or something wrong in the program?
So as to speak, without using the distributedsampler, the dataloader will sent the same batch of data for each processes or GPU?
yes.
I am currently facing the problem that after using Bmtrain the training time would be 3x or 4x slower with the same batch size.
What is being compared to of "the 3x or 4x slower"? In our ModelCenter/examples
folders, there are bash scripts to run the examples, the batch size specified in the script is the per GPU batch size. With small GPU amounts, changing AdamOffloadOptimizer to AdamOptimizer would give much speedup. With larger GPU amounts, using AdamOffloadOptimizer rather than AdamOptimizer would consume less GPU memory but not use much time. It is a tradeoff.
So as to speak, without using the distributedsampler, the dataloader will sent the same batch of data for each processes or GPU?
yes.
I am currently facing the problem that after using Bmtrain the training time would be 3x or 4x slower with the same batch size.
What is being compared to of "the 3x or 4x slower"? In our
ModelCenter/examples
folders, there are bash scripts to run the examples, the batch size specified in the script is the per GPU batch size. With small GPU amounts, changing AdamOffloadOptimizer to AdamOptimizer would give much speedup. With larger GPU amounts, using AdamOffloadOptimizer rather than AdamOptimizer would consume less GPU memory but not use much time. It is a tradeoff.
Comparing with the program without BMtrain which train on one GPU only. So only the AdamOffloadOptimizer using the Zero3 algorithm? May I ask the distributedparameter itself would function as lowering the GPU memory or accelerate the program?
So only the AdamOffloadOptimizer using the Zero3 algorithm?
Not only. DistributedModule and DistributedParameter are both for ZeRO3, but they act like normal model with only one GPU. CheckpointBlock is another component that will lower the GPU memory and slightly lower the speed. I just say that in my experience, AdamOffloadOptimizer may cause the big slow down with only one GPU.
So only the AdamOffloadOptimizer using the Zero3 algorithm?
Not only. DistributedModule and DistributedParameter are both for ZeRO3, but they act like normal model with only one GPU. CheckpointBlock is another component that will lower the GPU memory and slightly lower the speed. I just say that in my experience, AdamOffloadOptimizer may cause the big slow down with only one GPU.
4x more training time, is the program with Bmtrain using 4GPU comparing with program without Bmtrain using one GPU. We are expecting a speed up here. The model itself is pretty simple such as 'TransE', probably a more complicated model such as Bert would realize the speed up using more GPUs with bmtrain?
TransE is too simple to use BMTrain. Bert would have comparable speed but lower GPU requirements. Even larger models would benefit more from BMTrain.
TransE is too simple to use BMTrain. Bert would have comparable speed but lower GPU requirements. Even larger models would benefit more from BMTrain.
Thanks a lot for the information, that's all I need to know.
Hi, is there a DistributedDataloader design necessary to work with the bmtrain for the accelerating, or the bmtrain method itself would realize the optimization for both the memory and the speed?