An error was reported while training the model with two 3090.

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 [INFO] ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 2 processes

[INFO] LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0,1] [INFO] LOCAL_RANK: 1 - CUDA_VISIBLE_DEVICES: [0,1] Traceback (most recent call last): File "/home/shengbo/HumanGaussian-main/launch.py", line 239, in main(args, extras) File "/home/shengbo/HumanGaussian-main/launch.py", line 182, in main trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit Traceback (most recent call last): File "/home/shengbo/HumanGaussian-main/launch.py", line 239, in main(args, extras) File "/home/shengbo/HumanGaussian-main/launch.py", line 182, in main trainer.fit(system, datamodule=dm, ckpt_path=cfg.resume) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 544, in fit call._call_and_handle_interrupt( File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt call._call_and_handle_interrupt( File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/call.py", line 43, in _call_and_handle_interrupt return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, kwargs) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch return trainer.strategy.launcher.launch(trainer_fn, *args, trainer=trainer, *kwargs) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 105, in launch return function(args, kwargs) return function(*args, kwargs) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 580, in _fit_impl self._run(model, ckpt_path=ckpt_path) self._run(model, ckpt_path=ckpt_path) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 963, in _run File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/trainer/trainer.py", line 963, in _run self.strategy.setup(self) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 171, in setup self.strategy.setup(self) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 171, in setup self.configure_ddp() File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 283, in configure_ddp self.configure_ddp() File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 283, in configure_ddp self.model = self._setup_model(self.model) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 195, in _setup_model self.model = self._setup_model(self.model) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/pytorch_lightning/strategies/ddp.py", line 195, in _setup_model return DistributedDataParallel(module=model, device_ids=device_ids, self._ddp_kwargs) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 678, in init return DistributedDataParallel(module=model, device_ids=device_ids, **self._ddp_kwargs) File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 678, in init self._log_and_throw( self._log_and_throw( File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw File "/home/shengbo/anaconda3/envs/humangs/lib/python3.9/site-packages/torch/nn/parallel/distributed.py", line 1037, in _log_and_throw raise err_type(err_msg) RuntimeError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient. raise err_type(err_msg) RuntimeError: DistributedDataParallel is not needed when a module doesn't have any parameter that requires a gradient.

alvinliu0 / HumanGaussian

An error was reported while training the model with two 3090. #23

Initializing distributed: GLOBAL_RANK: 1, MEMBER: 2/2 [INFO] ---------------------------------------------------------------------------------------------------- distributed_backend=nccl All distributed processes registered. Starting with 2 processes