Open Aonai-Lin opened 3 weeks ago
if gradient_checkpointing=True, the problem mentioned above occurred.
Hi, could you provide more information about the version of the environment (especially about deepspeed), and the contents of error?
deepspeed version: 0.12.6
transformers version: 4.37.0
the contents of error:
Traceback (most recent call last):
File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in
main()
File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main
File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main
main()
File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main
trainer.train(training_args.resume_from_checkpoint)
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
trainer.train(training_args.resume_from_checkpoint)
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
trainer.train(training_args.resume_from_checkpoint)
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train
Traceback (most recent call last):
File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.optimizer.overlapping_partition_gradients_reduce_epilogue()
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue
ret_val = func(*args, kwargs)
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue
self.independent_gradient_partition_epilogue()
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
self.__reduce_and_partition_ipg_grads()
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
ret_val = func(*args, *kwargs)
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue
ret_val = func(args, kwargs)
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
self.__reduce_and_partition_ipg_grads()
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn
return func(*args, kwargs)
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in __reduce_and_partition_ipg_grads
ret_val = func(*args, *kwargs)
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)
AssertionError
return func(args, kwargs)
File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in reduce_and_partition_ipg_grads
assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)
AssertionError
Traceback (most recent call last):
File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in
You can try installing deepspeed==0.12.2. We find that deepspeed==0.12.6 is not stable for training.
Alright, I'll go give it a try. Thanks a lot!
When training the MLLM, we want to unfreeze the Oryx ViT. We're using DeepSpeed ZeRO-3, and there are issues with gradient backpropagation.