Closed nihirv closed 3 years ago
Hi Are you running with DP accelerator? You are on 1.1.4 which is quite old and we have seen this error before. Please upgrade the version to the latest if you can, otherwise provide a repro example please.
Here in your code I see a suspicious line.
total_loss.to(self.args.device)
Please heck that this is the correct device.
(the device in lightning is available with self.device)
Hey, thanks for the quick response. I added the .to(self.args.device)
because to ensure it wasn't the loss that was causing the issue (wouldn't make much sense for it to be... but still thought it was worth a check because I'm unsure what items in torch.stack(items)
is).
I'm not using any accelerator, Trainer is as follows:
trainer = pl.Trainer(max_steps=args.total_training_steps, gradient_clip_val=5,
val_check_interval=250, limit_val_batches=200, callbacks=[early_stop_callback])
Ok, so I've updated to the 1.2.7. I now get different error during the validation sanity check (which could perhaps have been the reason the original error in the OP was getting thrown). However, I'm unsure why it's saying this, since I've put both the model and data on GPU:
File "main.py", line 258, in <module>
trainer.fit(trainVQG, data_loader, val_data_loader)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
self.run_sanity_check(self.lightning_module)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 860, in run_sanity_check
_, eval_results = self.run_evaluation(max_batches=self.num_sanity_val_batches)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 725, in run_evaluation
output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 166, in evaluation_step
output = self.trainer.accelerator.validation_step(args)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 177, in validation_step
return self.training_type_plugin.validation_step(*args)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 131, in validation_step
return self.lightning_module.validation_step(*args, **kwargs)
File "main.py", line 85, in validation_step
loss, kld = self(batch)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "main.py", line 48, in forward
loss = self.model(images, question_ids, question_attention_masks, input_ids, input_attention_masks)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/nv419/machine_drive/nihir-vqg/model.py", line 33, in forward
images = self.image_projection(images).unsqueeze(1) # [B, D]
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
input = module(input)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
return F.linear(input, self.weight, self.bias)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear
ret = torch.addmm(bias, input, weight.t())
RuntimeError: Tensor for 'out' is on CPU, Tensor for argument #1 'self' is on CPU, but expected them to be on GPU (while checking arguments for addmm)
The important lines are:
File "main.py", line 85, in validation_step
loss, kld = self(batch)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "main.py", line 48, in forward
loss = self.model(images, question_ids, question_attention_masks, input_ids, input_attention_masks)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
result = self.forward(*input, **kwargs)
File "/data/nv419/machine_drive/nihir-vqg/model.py", line 33, in forward
images = self.image_projection(images).unsqueeze(1) # [B, D]
So, as we can see, the model is on device
(which is GPU):
args = parser.parse_args()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
args.device = device
trainVQG = TrainVQG(args, tokenizer).to(device)
All the data has been put on device
too:
def forward(self, batch):
images, question_ids, question_attention_masks, input_ids, input_attention_masks = batch["images"], batch[
"question_ids"], batch["question_attention_masks"], batch["input_ids"], batch["input_attention_masks"]
images, question_ids, question_attention_masks, input_ids, input_attention_masks = images.to(self.args.device), question_ids.to(
self.args.device), question_attention_masks.to(self.args.device), input_ids.to(self.args.device), input_attention_masks.to(self.args.device)
print(images)
loss = self.model(images, question_ids, question_attention_masks, input_ids, input_attention_masks)
return loss
That print(images)
you see there validates that the device is cuda:0
. And from the line images = self.image_projection(images).unsqueeze(1) # [B, D]
, self.image_projection
is just an nn.Sequential()
module with a linear layer and a batch norm:
self.image_projection = nn.Sequential(
nn.Linear(512, 768),
nn.BatchNorm1d(768, momentum=0.01)
)
You don't need to move the data or model to GPU (and you shouldn't). Lightning takes care of that. Use the gpus Trainer argument :)
What is this magic 🤯. How come explicitly calling .to(device) on the model and data didn't work, but setting gpus
in trainer did?
Model is training now! Thanks for your rapid help - lets see if it still crashes after it completes an epoch
Same error after the end of epoch:
Traceback (most recent call last):
File "main.py", line 256, in <module>
trainer.fit(trainVQG, data_loader, val_data_loader)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
self.dispatch()
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
self.accelerator.start_training(self)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
self.training_type_plugin.start_training(trainer)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
self._results = trainer.run_train()
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
self.train_loop.run_training_epoch()
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 557, in run_training_epoch
self.on_train_epoch_end(epoch_output)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 802, in on_train_epoch_end
self.trainer.logger_connector.on_train_epoch_end()
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 373, in on_train_epoch_end
self.cached_results.has_batch_loop_finished = True
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 386, in has_batch_loop_finished
self.auto_reduce_results_on_epoch_end()
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 376, in auto_reduce_results_on_epoch_end
hook_result.auto_reduce_results_on_epoch_end()
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 183, in auto_reduce_results_on_epoch_end
outputs = type(time_reduced_outputs[0]).reduce_on_epoch_end(time_reduced_outputs)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 535, in reduce_on_epoch_end
recursive_stack(result)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 677, in recursive_stack
result[k] = collate_tensors(v)
File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 699, in collate_tensors
return torch.stack(items)
RuntimeError: All input tensors must be on the same device. Received cpu and cuda:0
@nihirv please have a look at our bug report model. If you can share a minimal example that would help us fix this faster. and quick tipp: You can set Trainer(limit_train_batches=X) so you don't have to wait 14 hours for the epoch to finish :)
Apologies, for the delayed response - been busy with EMNLP. But I've figured out where the bug is coming from! This is very weird because I have another codebase which uses exactly the same logic and it works perfectly fine there.
You can see the code here: https://github.com/nihirv/controllable-vqg/tree/lightning_branch. I've set up the data loader to create dummy examples so once you clone it, all you need to do is run this command: python3 main.py --num_warmup_steps 0
.
Here's how the code works and where the bug is coming from. After a certain amount of warmup steps, I want to start training a latent variable model (VAE). I have a flag in model.py
(line 33, self.latent_transformer
) which is used to control whether we're going to run the VAE or not. After num_warmup_steps
have passed, that is when I want to turn on the variational training. This 'turning on' happens on line 69 in main.py
, where I check if num_warmup_steps < training_iterations
and run self.model.switch_latent_transformer(True)
which simply sets self.latent_transformer
in model.py
to True
.
In the forward()
(and decode_greedy()
) method of my model I have an if
statement which checks self.latent_transformer
and runs the latent module (self.latent
) if it is true. This if
is where the bug is coming from. In model.py
, if you move L77 and L126 up one line (or comment out JUST the if
lines on L76 and L125), the model works fine. The presence of the if
there seems to cause issues.
This is very weird because as I said, I have another codebase which has basically the same logic and that runs fine... not sure what the difference is here
There seem to many dependencies I need to install manually. It looks like you don't have a requirements file in that repo. I'm afraid I won't have the time to debug this code unless it's a minimal.
Make sure that when you switch to the new model, you have it on the right device. I suggest you put a debugger breakpoint at the points in the code where you get the outputs of the submodels and inspect the model weights and input/outputs to see if they are on the cuda device.
Do you have any intuition about an if statement causing things to be placed on different devices?
Fixed the bug! Thanks for the advice about the debugger - wouldn't have been able to solve it without it. This isn't a strictly a lightning issue but having a useful warnings for self.log
could be worthwhile. Basically, I had a calculate_losses()
method in my pl.LightningModule
class:
def calculate_losses(self, loss, kld, r=0.5):
if kld is None:
loss_rec = loss # not relevant to bug
total_loss = loss # not relevant to bug
loss_kl = torch.tensor(0) # THIS LINE CAUSES THE BUG
else:
#some other stuff
total_loss = loss + beta * kld
return total_loss, loss_rec, loss_kl
The function checked whether a parameter (in this case kld
) is None
or not. If it is None
, then we have loss_kl = torch.tensor(0)
. In my training_step
, I did a self.log('kl train loss', loss_kl)
of the returned loss_kl
(whether or not kld
was None). Eventually (internal to Lightning), collate_tensors
in step_result.py
was called (line 699). The stack-trace is as follows:
collate_tensors (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\core\step_result.py:699)
recursive_stack (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\core\step_result.py:677)
reduce_on_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\core\step_result.py:535)
auto_reduce_results_on_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\connectors\logger_connector\epoch_result_store.py:183)
auto_reduce_results_on_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\connectors\logger_connector\epoch_result_store.py:376)
has_batch_loop_finished (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\connectors\logger_connector\epoch_result_store.py:386)
on_train_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\connectors\logger_connector\logger_connector.py:373)
on_train_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\training_loop.py:802)
run_training_epoch (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\training_loop.py:557)
run_train (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\trainer.py:637)
start_training (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py:114)
start_training (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\accelerators\accelerator.py:73)
dispatch (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\trainer.py:546)
fit (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\trainer.py:499)
<module> (\data\nv419\machine_drive\nihir-vqg\main.py:266)
_run_code (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\runpy.py:87)
_run_module_code (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\runpy.py:97)
run_path (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\runpy.py:265)
_run_code (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\runpy.py:87)
_run_module_as_main (Current frame) (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\runpy.py:194)
Now, on the most RECENT call to auto_reduce_results_on_epoch_end
(i.e. auto_reduce_results_on_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\connectors\logger_connector\epoch_result_store.py:183)
), I see that there's an attribute called outputs
. Inside of outputs there's a batch size amount of keys which store the logged metrics/losses. For the first batch only (i.e. key 0
), I see that my kl train loss
isn't placed on GPU (this is with the gpus
argument on the trainer enabled).
I couldn't find this kl train loss
for batch item 0 in previous stacks, but I'm sure it does exist there.
Anyway, my solution was to place the loss_kl = torch.tensor(0)
on gpu. Specifically:
def calculate_losses(self, loss, kld, r=0.5):
if kld is None:
loss_rec = loss # not relevant to bug
total_loss = loss # not relevant to bug
loss_kl = torch.tensor(0).to(self.args.device) # FIX. self.args.device = "cuda:0"
else:
#some other stuff
total_loss = loss + beta * kld
return total_loss, loss_rec, loss_kl
Feel free to close or let me know if you have any follow-up questions
Great. Happy you were able to find it yourself with the debugger.
Btw, LightningModule knows which device it is on. Instead of self.args.device
you can just ask for self.device
directly.
... or, generally also outside LightningModule in pure pytorch, this is also a solution:
total_loss = loss
loss_kl = torch.tensor(0).to(loss.device)
# or
loss_kl = torch.zeros_like(loss)
FWIW I noticed the same runtime error in Lightning v1.2.5 under the specific operation order of
trainer.fit()
trainer.test()
model.to_onnx()
If trainer.test()
is not run, then I have no error. Upgrading to v1.3.0 solved the issue, so I won't worry about a minimal repro with the BoringModel unless someone thinks that would still be useful. Mostly noting for reference of future readers
🐛 Bug
At the end of the epoch, I get the error mentioned in the title. Here is a full stack-trace:
The error gets thrown at the end of an epoch. The model I've built is a VAE/Latent Variable model. Note that this error does NOT get thrown (and the code works perfectly fine) if I do not use the VAE part of the model. The BoringModel doesn't reproduce the error, and I've got some research code in here which I'd rather not make public at this point in time. However, these are the parts of my code which may be contributing to the error:
Interestingly, OP of this #5053 issue also seemed to be using a VAE and had the error thrown (although the solution to his problem isn't the solution to my problem)
Expected behavior
For the model to continue training as normal
Environment
I'm running this code on a V100