Multi-GPU training issues

JunZhan2000 commented 1 year ago

Hello, thank you very much for your work. Can you give a code for multi-GPU or multi-node training?

Qiyuan-Ge commented 1 year ago

Hi. You could check the accelerate doc in hugging face

For single-GPU, python train.py

For multi-GPU, accelerate launch --multi_gpu train.py

JunZhan2000 commented 1 year ago

Hi. You could check the accelerate doc in hugging face

For single-GPU, python train.py

For multi-GPU, accelerate launch --multi_gpu train.py

Thanks, I will try it. Could you help me with this issue? https://github.com/Qiyuan-Ge/PaintMind/issues/8

JunZhan2000 commented 1 year ago

Hi. You could check the accelerate doc in hugging face

For single-GPU, python train.py

For multi-GPU, accelerate launch --multi_gpu train.py

hi, I used the following command to train on 4 GPUs.

python -m torch.distributed.launch --nproc_per_node 4 --use_env train_vit_vqgan.py

At first, it worked fine, but every time I reached the last batch of epoch, I reported the following error. Changing the size of the data set always had the same result. Have you trained on multiple GPUs before?

/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py:200: UserWarning: Error detecte d in NativeBatchNormBackward0. Traceback of forward call that caused the error:
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in
trainer.train()
File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 192, in train
real_pred = self.discr(img)
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward
return module_to_run(inputs[0], kwargs[0]) # type: ignore[index]
File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 636, in forward return model_forward(*args, *kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 624, in call return convert_to_fp32(self.model_forward(args, kwargs)) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, kwargs) File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/stage1/discriminator.py", line 60 , in forward return self.model(input) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) [721/13764] File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forwa rd input = module(input) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_i mpl return forward_call(args, kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 171, in forwa rd return F.batch_norm( File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/functional.py", line 2450, in batch_norm return torch.batch_norm( (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Traceback (most recent call last): File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in trainer.train() File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 197, in train self.accelerator.backward(d_loss) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/accelerator.py", line 1985, in backward loss.backward(**kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 8; expected version 7 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Qiyuan-Ge commented 1 year ago

Hi. You could check the accelerate doc in hugging face For single-GPU, python train.py For multi-GPU, accelerate launch --multi_gpu train.py

hi, I used the following command to train on 4 GPUs.

python -m torch.distributed.launch --nproc_per_node 4 --use_env train_vit_vqgan.py

At first, it worked fine, but every time I reached the last batch of epoch, I reported the following error. Changing the size of the data set always had the same result. Have you trained on multiple GPUs before?

/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init.py:200: UserWarning: Error detecte d in NativeBatchNormBackward0. Traceback of forward call that caused the error: File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in trainer.train() File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 192, in train real_pred = self.discr(img) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1156, in forward output = self._run_ddp_forward(*inputs, *kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1110, in _run_ddp_forward return module_to_run(inputs[0], kwargs[0]) # type: ignore[index] File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 636, in forward return model_forward(*args, kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/utils/operations.py", line 624, in call* return convert_to_fp32(self.model_forward(args, kwargs)) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/amp/autocast_mode.py", line 14, in decorate_autocast return func(*args, kwargs) File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/stage1/discriminator.py", line 60 , in forward return self.model(input) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl return forward_call(*args, *kwargs) [721/13764] File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/container.py", line 217, in forwa rd input = module(input) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_i mpl return forward_call(args, kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/modules/batchnorm.py", line 171, in forwa rd return F.batch_norm( File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/nn/functional.py", line 2450, in batch_norm return torch.batch_norm( (Triggered internally at ../torch/csrc/autograd/python_anomaly_mode.cpp:114.) Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass Traceback (most recent call last): File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/train_vit_vqgan.py", line 36, in trainer.train() File "/cpfs01/projects-SSD/cfff-173661e84712_SSD/public/zhanjun/visiontokenizer/PaintMind/paintmind/utils/trainer.py", line 197, in train self.accelerator.backward(d_loss) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/accelerate/accelerator.py", line 1985, in backward loss.backward(kwargs) File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/_tensor.py", line 487, in backward torch.autograd.backward( File "/home/stx_19110240005/miniconda3/envs/vit-vqgan2/lib/python3.9/site-packages/torch/autograd/init**.py", line 200, in backward Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass RuntimeError: one of the variables needed for gradient computation has been modified by an inplace operation: [torch.cuda.FloatTensor [512]] is at version 8; expected version 7 instead. Hint: the backtrace further above shows the operation that failed to compute its gradient. The variable in question was changed in there or anywhere later. Good luck!

Hi. I'm really sorry for replying so late. First, I have trained on multiple GPUs before. Second question, Here is the reply of GPT4: The error message you're seeing is related to PyTorch's autograd engine, which is responsible for performing the backward pass and computing gradients. The error is specifically indicating that one of the variables required for gradient computation has been modified by an inplace operation after its creation, which is not allowed in PyTorch because it interferes with the tracking of operations for gradient computation.

The error message provides a hint: "The variable in question was changed in there or anywhere later. Good luck!" This suggests that the problematic variable is being modified somewhere after its creation, either within the operation that failed to compute its gradient or somewhere later in your code.

Here are some steps you could take to troubleshoot this issue:

Search for inplace operations: In your code, search for any inplace operations that might be modifying variables after their creation. Inplace operations in PyTorch are usually denoted by an underscore at the end of the method name, like add_(), zero_(), copy_(), etc.
Disable inplace operations for debugging: As a debugging step, you could try temporarily disabling inplace operations in your code to see if the error goes away. If it does, this confirms that an inplace operation is the problem, and you can then focus on figuring out which one it is and how to avoid it.
Ensure that all operations are part of the computational graph: If you're using operations that are not part of PyTorch's computational graph, like operations from NumPy or Python's standard library, ensure that these are not modifying any PyTorch tensors in place.
Use torchviz to visualize the computation graph: The torchviz library provides a way to visualize the computation graph, which can be helpful for understanding the flow of data and operations in your model. This might help you identify where the problematic inplace operation is occurring.
Upgrade your PyTorch version: Sometimes, this kind of problem can be caused by bugs in the PyTorch autograd engine itself. If you're not using the latest version of PyTorch, consider upgrading to see if the problem goes away. Make sure to check the PyTorch release notes to see if any relevant bugs were fixed in more recent versions.

Remember when modifying your code, the goal is to ensure that any variable that's part of the computational graph is not modified in-place after its creation. If this is not possible due to the requirements of your model, you might need to rethink your model's architecture to avoid the need for inplace operations.

Qiyuan-Ge commented 1 year ago

You could also add my contact(I guess you use WeChat) if you want to keep in touch with me.

Johnathan-Xie commented 8 months ago

I also had this error and passed track_running_stats=false to the norm layer in the discriminator and it seems to run fine. However I do think this will adversely impact the model performance so ideally some other fix is found. Also I noticed there is no conversion to syncbatchnorm. Is that intentional? I may be wrong but I believe this will yield the wrong batch statistics (or at least not compute them across all gpu processes).

Qiyuan-Ge / PaintMind

Multi-GPU training issues #7