JanEGerken / HEAL-SWIN

Reference implementation of the spherical vision transformer HEAL-SWIN
MIT License
34 stars 4 forks source link

Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! #7

Open Haobo-Liu opened 2 days ago

Haobo-Liu commented 2 days ago

Hello, When I run "python3 run.py --env local train --config_path=heal_swin/run_configs/depth_estimation/depthswin train_run_config.py" on a single 3090 GPU, I met the following errors:

41.3 M Trainable params 0 Non-trainable params 41.3 M Total params 165.115 Total estimated model params size (MB) Validation sanity check: 0%| | 0/2 [00:00<?, ?it/s]/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( /home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( /home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( /home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( /home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torchvision/transforms/functional.py:1603: UserWarning: The default value of the antialias parameter of all the resizing transforms (Resize(), RandomResizedCrop(), etc.) will change from None to True in v0.17, in order to be consistent across the PIL and Tensor backends. To suppress this warning, directly pass antialias=True (recommended, future default), antialias=None (current default, which means False for Tensors and True for PIL), or antialias=False (only works on Tensors - PIL will still use antialiasing). This also applies if you are using the inference transforms from the models weights: update the call to weights.transforms(antialias=True). warnings.warn( Exception detected, logging run as killed in MLFlow... Traceback (most recent call last): File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/train.py", line 297, in main() File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/train.py", line 285, in main train_model( File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/train.py", line 228, in train_model trainer.fit(model, dm) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 458, in fit self._run(model) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 756, in _run self.dispatch() File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 797, in dispatch self.accelerator.start_training(self) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 96, in start_training self.training_type_plugin.start_training(trainer) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 144, in start_training self._results = trainer.run_stage() File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 807, in run_stage return self.run_train() File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 842, in run_train self.run_sanity_check(self.lightning_module) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1107, in run_sanity_check self.run_evaluation() File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 962, in run_evaluation output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 174, in evaluation_step output = self.trainer.accelerator.validation_step(args) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 226, in validation_step return self.training_type_plugin.validation_step(args) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/ddp.py", line 322, in validation_step return self.model(args, kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1519, in forward else self._run_ddp_forward(*inputs, kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1355, in _run_ddp_forward return self.module(*inputs, *kwargs) # type: ignore[index] File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/pytorch_lightning/overrides/base.py", line 57, in forward output = self.module.validation_step(*inputs, *kwargs) File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_lightning/depth_estimation/model_lightning_depth_swin.py", line 127, in validation_step loss, preds = self.shared_step(batch, self.val_metrics) File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_lightning/depth_estimation/model_lightning_depth_swin.py", line 144, in shared_step outputs = self(imgs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_lightning/depth_estimation/model_lightning_depth_swin.py", line 91, in forward outputs = self.model(x.float()) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_torch/swin_transformer.py", line 1126, in forward x, x_downsample = self.forward_features(x) File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_torch/swin_transformer.py", line 1076, in forward_features x = layer(x) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, *kwargs) File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_torch/swin_transformer.py", line 626, in forward x = blk(x) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(args, kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(*args, kwargs) File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_torch/swin_transformer.py", line 381, in forward attn_windows = self.attn(x_windows, mask=self.attn_mask) # nWB, window_sizewindow_size, C File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl return self._call_impl(*args, *kwargs) File "/home/liusu/anaconda3/envs/heal_swin/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl return forward_call(args, kwargs) File "/home/liusu/projects/depth_estimation/HEAL-SWIN/heal_swin/models_torch/swin_transformer.py", line 167, in forward logit_scale = torch.clamp( RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu! (when checking argument for argument max in method wrapper_CUDA_clamp_Tensor)

Could you give me some advices? Thanks a lot!

ZhuYi2000 commented 1 day ago

In the file depthswin train_run_config.py, you may need to modify func get_pl_config to single GPU, here is my version: def get_pl_config(): from heal_swin.training.train_config import PLConfig

return PLConfig(
    max_epochs=1000,
    gpus=1,
    accelerator="",
    gradient_clip_val=0,
    gradient_clip_algorithm="norm",
)
JanEGerken commented 1 day ago

Hi, I couldn’t reproduce your error, it works for me... For training on a single GPU, you should set gpus=1 in PLConfig as @ZhuYi2000 wrote. Did that change things? Otherwise, could you check what tensor is on what device?