Open ShadowTeamCN opened 3 months ago
This seems unrelated to quantization or modelopt. To test this hypothesis, can you try calling calibrate_loop()
before mtq.quantize
step? I think you will see the same error when trying to run the calibrate_loop
without any quantization.
This seems unrelated to quantization or modelopt. To test this hypothesis, can you try calling
calibrate_loop()
beforemtq.quantize
step? I think you will see the same error when trying to run thecalibrate_loop
without any quantization.
yes, you are right, it truly no relationship with modelopt, I will close this issue. besides this it is also very weird
This seems unrelated to quantization or modelopt. To test this hypothesis, can you try calling
calibrate_loop()
beforemtq.quantize
step? I think you will see the same error when trying to run thecalibrate_loop
without any quantization.yes, you are right, it truly no relationship with modelopt, I will close this issue. besides this it is also very weird
https://github.com/huggingface/transformers/issues/32021
I've found the above issue similar to me ,and I'm also using torchrun with DeepSpeed ZeRO-3 config to launch my script. I'm wondering if the official QAT (Quantization Aware Training) pipeline is compatible with DeepSpeed ZeRO-3?
@ShadowTeamCN We have not yet tested the QAT example with DeepSpeed backend. I will test it once I get a chance - however this might be next week because of my some other commitments.
In the meantime, can you please try if QAT works for you with accelerate backend?
@ShadowTeamCN We have not yet tested the QAT example with DeepSpeed backend. I will test it once I get a chance - however this might be next week because of my some other commitments.
In the meantime, can you please try if QAT works for you with accelerate backend?
Certainly. I attempted to use Accelerate with DeepSpeed ZeRO-3, but encountered same issues. Subsequently, I switched to ZeRO-2 and achieved success. Once the training process is complete, I plan to test the remaining QAT pipeline.
@ShadowTeamCN We have not yet tested the QAT example with DeepSpeed backend. I will test it once I get a chance - however this might be next week because of my some other commitments.
In the meantime, can you please try if QAT works for you with accelerate backend?
And when I switch to zero-2 , I can not succeed to train 70B sized model due to restricted resources
Hi @ShadowTeamCN ,
This is because deepspeed initialize weights at a late stage right before training. The solution we currently have is
Option 1: Insert the calibration code in the trainer, after the deepspeed initialization and before checkpoint loading
Option 2:
Implement a post-completion hook for accelerator._prepare_deepspeed
, if you don't want to modify HF transformers code. I will share more details later on
Hi @ShadowTeamCN ,
This is because deepspeed initialize weights at a late stage right before training. The solution we currently have is
Option 1: Insert the calibration code in the trainer, after the deepspeed initialization and before checkpoint loading
Option 2: Implement a post-completion hook for
accelerator._prepare_deepspeed
, if you don't want to modify HF transformers code. I will share more details later on
Thank you for reply, both option is ok to me, and I find an anthor method by myself, that is calling trainer.evaluate before calib loop, because evalute would initialize deepspeed , but then I finally failed due to OOM, idk how many memory does QAT needs additionally, or if oom just raised because of my wrong use .
Using deepspeed together with modelopt causes memory leak under some circumstances, we are still investigating.
@ShadowTeamCN Adding gc.collect()
after every iteration fixes the memory leak
take the following code as simple example: