Oryx-mllm / Oryx

MLLM for On-Demand Spatial-Temporal Understanding at Arbitrary Resolution
https://oryx-mllm.github.io
291 stars 14 forks source link

how to train the VIT ? #18

Open Aonai-Lin opened 3 weeks ago

Aonai-Lin commented 3 weeks ago

When training the MLLM, we want to unfreeze the Oryx ViT. We're using DeepSpeed ZeRO-3, and there are issues with gradient backpropagation.

20241101-150723

Aonai-Lin commented 3 weeks ago

if gradient_checkpointing=True, the problem mentioned above occurred.

liuzuyan commented 2 weeks ago

Hi, could you provide more information about the version of the environment (especially about deepspeed), and the contents of error?

Aonai-Lin commented 2 weeks ago

deepspeed version: 0.12.6 transformers version: 4.37.0 the contents of error: Traceback (most recent call last): File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in Traceback (most recent call last): File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in Traceback (most recent call last): File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in main()
main() File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main main() File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main trainer.train(training_args.resume_from_checkpoint) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train trainer.train(training_args.resume_from_checkpoint) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train trainer.train(training_args.resume_from_checkpoint) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train Traceback (most recent call last): File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in main() File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main return inner_training_loop( File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop return inner_training_loop( File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop return inner_training_loop( File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop trainer.train(training_args.resume_from_checkpoint) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train tr_loss_step = self.training_step(model, inputs) File "/dc-hl/dai.guan/code/mhl_mllm/train/trainer.py", line 334, in training_step tr_loss_step = self.training_step(model, inputs) File "/dc-hl/dai.guan/code/mhl_mllm/train/trainer.py", line 334, in training_step return inner_training_loop( File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/dc-hl/dai.guan/code/mhl_mllm/train/trainer.py", line 334, in training_step tr_loss_step = self.training_step(model, inputs) File "/dc-hl/dai.guan/code/mhl_mllm/train/trainer.py", line 334, in training_step return super().training_step(model, inputs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 2777, in training_step return super().training_step(model, inputs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 2777, in training_step return super().training_step(model, inputs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 2777, in training_step self.accelerator.backward(loss) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward self.accelerator.backward(loss) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward self.accelerator.backward(loss) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward return super().training_step(model, inputs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 2777, in training_step self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward Traceback (most recent call last): File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in self.accelerator.backward(loss) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward main() File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.engine.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn trainer.train(training_args.resume_from_checkpoint) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train self.engine.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.engine.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn return inner_training_loop( File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1981, in backward ret_val = func(*args, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1981, in backward ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1981, in backward ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1981, in backward tr_loss_step = self.training_step(model, inputs) File "/dc-hl/dai.guan/code/mhl_mllm/train/trainer.py", line 334, in training_step Traceback (most recent call last): File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in self.allreduce_gradients() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.allreduce_gradients() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn return super().training_step(model, inputs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 2777, in training_step self.allreduce_gradients() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.allreduce_gradients() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1902, in allreduce_gradients ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1902, in allreduce_gradients ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1902, in allreduce_gradients main() File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main Traceback (most recent call last): File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1902, in allreduce_gradients self.accelerator.backward(loss) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward self.optimizer.overlapping_partition_gradients_reduce_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue trainer.train(training_args.resume_from_checkpoint) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train self.optimizer.overlapping_partition_gradients_reduce_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue self.optimizer.overlapping_partition_gradients_reduce_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue self.optimizer.overlapping_partition_gradients_reduce_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue main() File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward return inner_training_loop( File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop trainer.train(training_args.resume_from_checkpoint) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train self.engine.backward(loss, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn tr_loss_step = self.training_step(model, inputs) File "/dc-hl/dai.guan/code/mhl_mllm/train/trainer.py", line 334, in training_step ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1981, in backward return inner_training_loop( File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop return super().training_step(model, inputs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 2777, in training_step self.allreduce_gradients() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn tr_loss_step = self.training_step(model, inputs) File "/dc-hl/dai.guan/code/mhl_mllm/train/trainer.py", line 334, in training_step self.independent_gradient_partition_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.independent_gradient_partition_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.independent_gradient_partition_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.independent_gradient_partition_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn return super().training_step(model, inputs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 2777, in training_step self.accelerator.backward(loss) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1902, in allreduce_gradients ret_val = func(*args, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.accelerator.backward(loss) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward self.optimizer.overlapping_partition_gradients_reduce_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue self.reduce_and_partition_ipg_grads() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.__reduce_and_partition_ipg_grads() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.reduce_and_partition_ipg_grads() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.__reduce_and_partition_ipg_grads() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.engine.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.independent_gradient_partition_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ret_val = func(*args, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1981, in backward self.engine.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue self.allreduce_gradients() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1981, in backward self.__reduce_and_partition_ipg_grads() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn return func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in reduce_and_partition_ipg_grads return func(*args, **kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in reduce_and_partition_ipg_grads return func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in reduce_and_partition_ipg_grads return func(*args, **kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in reduce_and_partition_ipg_grads ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1902, in allreduce_gradients ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context self.allreduce_gradients() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket) assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket) AssertionError AssertionError assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket) AssertionError assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket) AssertionError return func(*args, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in __reduce_and_partition_ipg_grads ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1902, in allreduce_gradients self.optimizer.overlapping_partition_gradients_reduce_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket) AssertionError self.independent_gradient_partition_epilogue()

File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.optimizer.overlapping_partition_gradients_reduce_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue self.independent_gradient_partition_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn self.__reduce_and_partition_ipg_grads() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context self.__reduce_and_partition_ipg_grads() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn return func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in __reduce_and_partition_ipg_grads ret_val = func(*args, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket) AssertionError return func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in reduce_and_partition_ipg_grads assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket) AssertionError Traceback (most recent call last): File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 198, in main() File "/dc-hl/dai.guan/code/mhl_mllm/train/train.py", line 189, in main trainer.train(training_args.resume_from_checkpoint) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1539, in train return inner_training_loop( File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 1869, in _inner_training_loop tr_loss_step = self.training_step(model, inputs) File "/dc-hl/dai.guan/code/mhl_mllm/train/trainer.py", line 334, in training_step return super().training_step(model, inputs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/transformers/trainer.py", line 2777, in training_step self.accelerator.backward(loss) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/accelerator.py", line 1847, in backward self.deepspeed_engine_wrapped.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/accelerate/utils/deepspeed.py", line 167, in backward self.engine.backward(loss, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1981, in backward self.allreduce_gradients() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1902, in allreduce_gradients self.optimizer.overlapping_partition_gradients_reduce_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1098, in overlapping_partition_gradients_reduce_epilogue self.independent_gradient_partition_epilogue() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(args, kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1075, in independent_gradient_partition_epilogue self.reduce_and_partition_ipg_grads() File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 15, in wrapped_fn ret_val = func(*args, *kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context return func(args, **kwargs) File "/home/yisha.chen/.conda/envs/cys1/lib/python3.10/site-packages/deepspeed/runtime/zero/stage3.py", line 1188, in __reduce_and_partition_ipg_grads assert len(set(p.ds_id for p in self.params_in_ipg_bucket)) == len(self.params_in_ipg_bucket)

liuzuyan commented 2 weeks ago

You can try installing deepspeed==0.12.2. We find that deepspeed==0.12.6 is not stable for training.

Aonai-Lin commented 2 weeks ago

Alright, I'll go give it a try. Thanks a lot!