OpenBMB / BMTrain

Efficient Training (including pre-training and fine-tuning) for Big Models
Apache License 2.0
549 stars 74 forks source link

About DistributedDataloader #27

Closed Kunlun-Zhu closed 2 years ago

Kunlun-Zhu commented 2 years ago

Hi, is there a DistributedDataloader design necessary to work with the bmtrain for the accelerating, or the bmtrain method itself would realize the optimization for both the memory and the speed?

Achazwl commented 2 years ago

DistributedDataloader is in ModelCenter, and it is not a necessary design for bmtrain. It is just to make fine-tuning more convenient.

Kunlun-Zhu commented 2 years ago

DistributedDataloader is in ModelCenter, and it is not a necessary design for bmtrain. It is just to make fine-tuning more convenient.

Thanks for the response, may I further ask what do you mean by more convenient. Wouldn't it affect the training results or time in one way or another?

Achazwl commented 2 years ago

like torch.utils.data.distributed.DistributedSampler, different processes read a different part of the data for data-parallel training ( ZeRO3 can be viewed as a type of data-parallel ). The class DistributedDataloader just wraps it.

Kunlun-Zhu commented 2 years ago

like torch.utils.data.distributed.DistributedSampler, different processes read a different part of the data for data-parallel training ( ZeRO3 can be viewed as a type of data-parallel ). The class DistributedDataloader just wraps it.

Thanks for the further information. So as to speak, without using the distributedsampler, the dataloader will sent the same batch of data for each processes or GPU? So using the distributedsampler seems will be much faster? I am currently facing the problem that after using Bmtrain the training time would be 3x or 4x slower with the same batch size. Would that be normal or something wrong in the program?

Achazwl commented 2 years ago

So as to speak, without using the distributedsampler, the dataloader will sent the same batch of data for each processes or GPU?

yes.

I am currently facing the problem that after using Bmtrain the training time would be 3x or 4x slower with the same batch size.

What is being compared to of "the 3x or 4x slower"? In our ModelCenter/examples folders, there are bash scripts to run the examples, the batch size specified in the script is the per GPU batch size. With small GPU amounts, changing AdamOffloadOptimizer to AdamOptimizer would give much speedup. With larger GPU amounts, using AdamOffloadOptimizer rather than AdamOptimizer would consume less GPU memory but not use much time. It is a tradeoff.

Kunlun-Zhu commented 2 years ago

So as to speak, without using the distributedsampler, the dataloader will sent the same batch of data for each processes or GPU?

yes.

I am currently facing the problem that after using Bmtrain the training time would be 3x or 4x slower with the same batch size.

What is being compared to of "the 3x or 4x slower"? In our ModelCenter/examples folders, there are bash scripts to run the examples, the batch size specified in the script is the per GPU batch size. With small GPU amounts, changing AdamOffloadOptimizer to AdamOptimizer would give much speedup. With larger GPU amounts, using AdamOffloadOptimizer rather than AdamOptimizer would consume less GPU memory but not use much time. It is a tradeoff.

Comparing with the program without BMtrain which train on one GPU only. So only the AdamOffloadOptimizer using the Zero3 algorithm? May I ask the distributedparameter itself would function as lowering the GPU memory or accelerate the program?

Achazwl commented 2 years ago

So only the AdamOffloadOptimizer using the Zero3 algorithm?

Not only. DistributedModule and DistributedParameter are both for ZeRO3, but they act like normal model with only one GPU. CheckpointBlock is another component that will lower the GPU memory and slightly lower the speed. I just say that in my experience, AdamOffloadOptimizer may cause the big slow down with only one GPU.

Kunlun-Zhu commented 2 years ago

So only the AdamOffloadOptimizer using the Zero3 algorithm?

Not only. DistributedModule and DistributedParameter are both for ZeRO3, but they act like normal model with only one GPU. CheckpointBlock is another component that will lower the GPU memory and slightly lower the speed. I just say that in my experience, AdamOffloadOptimizer may cause the big slow down with only one GPU.

4x more training time, is the program with Bmtrain using 4GPU comparing with program without Bmtrain using one GPU. We are expecting a speed up here. The model itself is pretty simple such as 'TransE', probably a more complicated model such as Bert would realize the speed up using more GPUs with bmtrain?

Achazwl commented 2 years ago

TransE is too simple to use BMTrain. Bert would have comparable speed but lower GPU requirements. Even larger models would benefit more from BMTrain.

Kunlun-Zhu commented 2 years ago

TransE is too simple to use BMTrain. Bert would have comparable speed but lower GPU requirements. Even larger models would benefit more from BMTrain.

Thanks a lot for the information, that's all I need to know.