Open Haobo-Liu opened 2 days ago
In the file depthswin train_run_config.py, you may need to modify func get_pl_config to single GPU, here is my version: def get_pl_config(): from heal_swin.training.train_config import PLConfig
return PLConfig(
max_epochs=1000,
gpus=1,
accelerator="",
gradient_clip_val=0,
gradient_clip_algorithm="norm",
)
Hi,
I couldn’t reproduce your error, it works for me... For training on a single GPU, you should set gpus=1
in PLConfig
as @ZhuYi2000 wrote. Did that change things? Otherwise, could you check what tensor is on what device?
Hello, When I run "python3 run.py --env local train --config_path=heal_swin/run_configs/depth_estimation/depthswin train_run_config.py" on a single 3090 GPU, I met the following errors:
41.3 M Trainable params 0 Non-trainable params 41.3 M Total params 165.115 Total estimated model params size (MB) Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( /home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( /home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( /home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( /home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( Exception detected, logging run as killed in MLFlow... Traceback (most recent call last): File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/train.py", line 297, in
main()
File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/train.py", line 285, in main
train_model(
File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/train.py", line 228, in train_model
trainer.fit(model, dm)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit
self._run(model)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run
self.dispatch()
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch
self.accelerator.start_training(self)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training
self.training_type_plugin.start_training(trainer)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training
self._results = trainer.run_stage()
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage
return self.run_train()
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in run_train
self.run_sanity_check(self.lightning_module)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1107, in run_sanity_check
self.run_evaluation()
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation
output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 174, in evaluation_step
output = self.trainer.accelerator.validation_step(args)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in validation_step
return self.training_type_plugin.validation_step(args)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 322, in validation_step
return self.model(args, kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward
else self._run_ddp_forward(*inputs, kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward
return self.module(*inputs, *kwargs) # type: ignore[index]
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 57, in forward
output = self.module.validation_step(*inputs, *kwargs)
File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_lightning/depth_estimation/model_lightning_depth_swin.py", line 127, in validation_step
loss, preds = self.shared_step(batch, self.val_metrics)
File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_lightning/depth_estimation/model_lightning_depth_swin.py", line 144, in shared_step
outputs = self(imgs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_lightning/depth_estimation/model_lightning_depth_swin.py", line 91, in forward
outputs = self.model(x.float())
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_torch/swin_transformer.py", line 1126, in forward
x, x_downsample = self.forward_features(x)
File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_torch/swin_transformer.py", line 1076, in forward_features
x = layer(x)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, *kwargs)
File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_torch/swin_transformer.py", line 626, in forward
x = blk(x)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(args, kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, kwargs)
File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_torch/swin_transformer.py", line 381, in forward
attn_windows = self.attn(x_windows, mask=self.attn_mask) # nWB, window_sizewindow_size, C
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, *kwargs)
File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(args, kwargs)
File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_torch/swin_transformer.py", line 167, in forward
logit_scale = torch.clamp(
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor)
Could you give me some advices? Thanks a lot!