Tensors must be CUDA and dense

yhyang-myron commented 1 year ago

When I use source scripts/text_representation_train.sh to train, there was always a RuntimeError.

RuntimeError: Tensors must be CUDA and dense

RozDavid commented 1 year ago

Hey @believexx,

Thanks for opening the issue. Could you please be more specific about the error. E.g which module throws it, at which line, if you using a pretrained module, what is your system context and anything else that might be helpful to backtrace this problem.

Cheers, David

yhyang-myron commented 1 year ago

Thank you very much for your reply! Here is my output log：

Traceback (most recent call last): File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/runpy.py", line 194, in _run_module_as_main return _run_code(code, main_globals, None, File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/runpy.py", line 87, in _run_code exec(code, run_globals) File "/home/xhliu/LanguageGroundedSemseg/main.py", line 210, in main() File "/home/xhliu/LanguageGroundedSemseg/main.py", line 201, in main trainer.fit(pl_module, ckpt_path=config.resume) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 696, in fit self._call_and_handle_interrupt( File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 648, in _call_and_handle_interrupt return self.strategy.launcher.launch(trainer_fn, *args, trainer=self, kwargs) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/strategies/launchers/subprocess_script.py", line 93, in launch return function(*args, *kwargs) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 735, in _fit_impl results = self._run(model, ckpt_path=self.ckpt_path) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1166, in _run results = self._run_stage() File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1252, in _run_stage return self._run_train() File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1274, in _run_train self._run_sanity_check() File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1343, in _run_sanity_check val_loop.run() File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(args, kwargs) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/loops/dataloader/evaluation_loop.py", line 155, in advance dl_outputs = self.epoch_loop.run(self._data_fetcher, dl_max_batches, kwargs) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/loops/loop.py", line 200, in run self.advance(args, kwargs) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 143, in advance output = self._evaluation_step(kwargs) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/loops/epoch/evaluation_epoch_loop.py", line 240, in _evaluation_step output = self.trainer._call_strategy_hook(hook_name, kwargs.values()) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 1704, in _call_strategy_hook output = fn(*args, kwargs) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/pytorch_lightning/strategies/ddp.py", line 358, in validation_step return self.model(*args, *kwargs) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1110, in _call_impl return forward_call(input, kwargs) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 955, in forward self._sync_buffers() File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1602, in _sync_buffers self._sync_module_buffers(authoritative_rank) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1606, in _sync_module_buffers self._default_broadcast_coalesced(authoritative_rank=authoritative_rank) File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1627, in _default_broadcast_coalesced self._distributed_broadcast_coalesced( File "/home/xhliu/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/parallel/distributed.py", line 1543, in _distributed_broadcast_coalesced dist._broadcast_coalesced( RuntimeError: Tensors must be CUDA and dense

RozDavid commented 1 year ago

Hey,

So the tasks fails before even touching parts of my code, and is most probably a Pytorch Lightning multi-gpu DDP error. After a quick Google search I found a same error here. Please refer to the potential sollutions provided at that thread.

Regards, David

P.s. closing the issue now as not related to this project. Feel free to reopen if something still fails locally.

RozDavid / LanguageGroundedSemseg

Tensors must be CUDA and dense #3