RuntimeError: All input tensors must be on the same device. Received cpu and cuda:0

🐛 Bug

At the end of the epoch, I get the error mentioned in the title. Here is a full stack-trace:

File "main.py", line 255, in <module>
    trainer.fit(trainVQG, data_loader, val_data_loader)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 473, in fit
    results = self.accelerator_backend.train()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/accelerators/cpu_accelerator.py", line 60, in train
    results = self.train_or_test()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 69, in train_or_test
    results = self.trainer.train()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 524, in train
    self.train_loop.run_training_epoch()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 637, in run_training_epoch
    self.run_on_epoch_end_hook(epoch_output)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 868, in run_on_epoch_end_hook
    self.trainer.logger_connector.on_train_epoch_end()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 370, in on_train_epoch_end
    self.cached_results.has_batch_loop_finished = True
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 445, in has_batch_loop_finished
    self.auto_reduce_results_on_epoch_end()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 435, in auto_reduce_results_on_epoch_end
    hook_result.auto_reduce_results_on_epoch_end()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 229, in auto_reduce_results_on_epoch_end
    opt_outputs = time_reduced_outputs[0].__class__.reduce_on_epoch_end(time_reduced_outputs)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 519, in reduce_on_epoch_end
    recursive_stack(result)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 660, in recursive_stack
    result[k] = collate_tensors(v)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 682, in collate_tensors
    return torch.stack(items)
RuntimeError: All input tensors must be on the same device. Received cpu and cuda:0

The error gets thrown at the end of an epoch. The model I've built is a VAE/Latent Variable model. Note that this error does NOT get thrown (and the code works perfectly fine) if I do not use the VAE part of the model. The BoringModel doesn't reproduce the error, and I've got some research code in here which I'd rather not make public at this point in time. However, these are the parts of my code which may be contributing to the error:

# training_step
def training_step(self, batch, batch_idx):
        if self.args.num_warmup_steps < self.iter:
            self.latent_transformer = True
            self.model.latent_transformer = True
            self.configure_optimizers()  # reset the momentum

        loss, kld = self(batch)
        total_loss, loss_rec, loss_kl = self.calculate_losses(loss, kld)
        self.log('total train loss', total_loss)
        self.log('rec train loss', loss_rec)
        self.log('kl train loss', loss_kl)
        self.iter += 1
        return total_loss

# calculating losses
    def calculate_losses(self, loss, kld, r=0.5):
        if kld is None:
            loss_rec = loss
            total_loss = loss
            loss_kl = torch.tensor(0)
        else:
            cycle_num = (self.args.total_training_steps/4)
            mod = self.iter % cycle_num
            temp = mod/cycle_num
            beta = 1
            if temp <= r:
                beta = 1/(1 + np.exp(-temp))

            loss_kl = kld
            loss_rec = loss
            total_loss = loss + beta * kld

        return total_loss.to(self.args.device), loss_rec, loss_kl

# inside self.latent
class LatentNorm(nn.Module):
    def __init__(self, args):
        super(LatentNorm, self).__init__()
        self.args = args

        self.latent_encoder = nn.Sequential(
            nn.Linear(args.hidden_dim, args.latent_dim*2),
            nn.LeakyReLU(0.2),
            nn.Linear(args.latent_dim*2, args.latent_dim*2)
        )

    def forward(self, hidden_states):
        latents = self.latent_encoder(hidden_states)
        mu, logvar = latents[:, :, :self.args.latent_dim], latents[:, :, self.args.latent_dim:]
        std = torch.exp(0.5*logvar)
        eps = torch.randn_like(std)
        z = eps.mul(std).add_(mu)
        kld = gaussian_kld_norm(mu, logvar)
        return z, kld

# inside model
kld = None
if self.latent_transformer:
    encoder_hidden_states, kld = self.latent(encoder_hidden_states)

Interestingly, OP of this #5053 issue also seemed to be using a VAE and had the error thrown (although the solution to his problem isn't the solution to my problem)

Expected behavior

For the model to continue training as normal

Environment

CUDA:
- GPU:
  - Tesla V100-PCIE-32GB
  - Tesla V100-PCIE-32GB
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
  - GeForce RTX 2080 Ti
- available: True
- version: 10.2
Packages:
- numpy: 1.19.2
- pyTorch_debug: False
- pyTorch_version: 1.7.1
- pytorch-lightning: 1.1.4
- tqdm: 4.60.0
System:
- OS: Linux
- architecture:
  - 64bit
  - ELF
- processor: x86_64
- python: 3.8.8
- version: #59~18.04.1-Ubuntu SMP Wed Oct 21 12:14:56 UTC 2020

I'm running this code on a V100

Hi Are you running with DP accelerator? You are on 1.1.4 which is quite old and we have seen this error before. Please upgrade the version to the latest if you can, otherwise provide a repro example please.

Here in your code I see a suspicious line. total_loss.to(self.args.device) Please heck that this is the correct device. (the device in lightning is available with self.device)

Hey, thanks for the quick response. I added the .to(self.args.device) because to ensure it wasn't the loss that was causing the issue (wouldn't make much sense for it to be... but still thought it was worth a check because I'm unsure what items in torch.stack(items) is).

I'm not using any accelerator, Trainer is as follows:

trainer = pl.Trainer(max_steps=args.total_training_steps, gradient_clip_val=5,
                      val_check_interval=250, limit_val_batches=200, callbacks=[early_stop_callback])

Ok, so I've updated to the 1.2.7. I now get different error during the validation sanity check (which could perhaps have been the reason the original error in the OP was getting thrown). However, I'm unsure why it's saying this, since I've put both the model and data on GPU:

File "main.py", line 258, in <module>
    trainer.fit(trainVQG, data_loader, val_data_loader)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 607, in run_train
    self.run_sanity_check(self.lightning_module)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 860, in run_sanity_check
    _, eval_results = self.run_evaluation(max_batches=self.num_sanity_val_batches)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 725, in run_evaluation
    output = self.evaluation_loop.evaluation_step(batch, batch_idx, dataloader_idx)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/evaluation_loop.py", line 166, in evaluation_step
    output = self.trainer.accelerator.validation_step(args)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 177, in validation_step
    return self.training_type_plugin.validation_step(*args)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 131, in validation_step
    return self.lightning_module.validation_step(*args, **kwargs)
  File "main.py", line 85, in validation_step
    loss, kld = self(batch)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "main.py", line 48, in forward
    loss = self.model(images, question_ids, question_attention_masks, input_ids, input_attention_masks)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/nv419/machine_drive/nihir-vqg/model.py", line 33, in forward
    images = self.image_projection(images).unsqueeze(1)  # [B, D]
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/container.py", line 117, in forward
    input = module(input)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/linear.py", line 93, in forward
    return F.linear(input, self.weight, self.bias)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/functional.py", line 1690, in linear
    ret = torch.addmm(bias, input, weight.t())
RuntimeError: Tensor for 'out' is on CPU, Tensor for argument #1 'self' is on CPU, but expected them to be on GPU (while checking arguments for addmm)

The important lines are:

  File "main.py", line 85, in validation_step
    loss, kld = self(batch)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "main.py", line 48, in forward
    loss = self.model(images, question_ids, question_attention_masks, input_ids, input_attention_masks)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/torch/nn/modules/module.py", line 727, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/data/nv419/machine_drive/nihir-vqg/model.py", line 33, in forward
    images = self.image_projection(images).unsqueeze(1)  # [B, D]

So, as we can see, the model is on device (which is GPU):

args = parser.parse_args()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
args.device = device
trainVQG = TrainVQG(args, tokenizer).to(device)

All the data has been put on device too:

def forward(self, batch):
        images, question_ids, question_attention_masks, input_ids, input_attention_masks = batch["images"], batch[
            "question_ids"], batch["question_attention_masks"], batch["input_ids"], batch["input_attention_masks"]
        images, question_ids, question_attention_masks, input_ids, input_attention_masks = images.to(self.args.device), question_ids.to(
            self.args.device), question_attention_masks.to(self.args.device), input_ids.to(self.args.device), input_attention_masks.to(self.args.device)

        print(images)

        loss = self.model(images, question_ids, question_attention_masks, input_ids, input_attention_masks)
        return loss

That print(images) you see there validates that the device is cuda:0. And from the line images = self.image_projection(images).unsqueeze(1) # [B, D], self.image_projection is just an nn.Sequential() module with a linear layer and a batch norm:

self.image_projection = nn.Sequential(
    nn.Linear(512, 768),
    nn.BatchNorm1d(768, momentum=0.01)
        )

You don't need to move the data or model to GPU (and you shouldn't). Lightning takes care of that. Use the gpus Trainer argument :)

What is this magic 🤯. How come explicitly calling .to(device) on the model and data didn't work, but setting gpus in trainer did?

Model is training now! Thanks for your rapid help - lets see if it still crashes after it completes an epoch

Same error after the end of epoch:

Traceback (most recent call last):
  File "main.py", line 256, in <module>
    trainer.fit(trainVQG, data_loader, val_data_loader)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 499, in fit
    self.dispatch()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 546, in dispatch
    self.accelerator.start_training(self)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/accelerators/accelerator.py", line 73, in start_training
    self.training_type_plugin.start_training(trainer)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/plugins/training_type/training_type_plugin.py", line 114, in start_training
    self._results = trainer.run_train()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/trainer.py", line 637, in run_train
    self.train_loop.run_training_epoch()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 557, in run_training_epoch
    self.on_train_epoch_end(epoch_output)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/training_loop.py", line 802, in on_train_epoch_end
    self.trainer.logger_connector.on_train_epoch_end()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/logger_connector.py", line 373, in on_train_epoch_end
    self.cached_results.has_batch_loop_finished = True
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 386, in has_batch_loop_finished
    self.auto_reduce_results_on_epoch_end()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 376, in auto_reduce_results_on_epoch_end
    hook_result.auto_reduce_results_on_epoch_end()
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/trainer/connectors/logger_connector/epoch_result_store.py", line 183, in auto_reduce_results_on_epoch_end
    outputs = type(time_reduced_outputs[0]).reduce_on_epoch_end(time_reduced_outputs)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 535, in reduce_on_epoch_end
    recursive_stack(result)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 677, in recursive_stack
    result[k] = collate_tensors(v)
  File "/data/nv419/anaconda3/envs/blt-vqg/lib/python3.8/site-packages/pytorch_lightning/core/step_result.py", line 699, in collate_tensors
    return torch.stack(items)
RuntimeError: All input tensors must be on the same device. Received cpu and cuda:0

@nihirv please have a look at our bug report model. If you can share a minimal example that would help us fix this faster. and quick tipp: You can set Trainer(limit_train_batches=X) so you don't have to wait 14 hours for the epoch to finish :)

Apologies, for the delayed response - been busy with EMNLP. But I've figured out where the bug is coming from! This is very weird because I have another codebase which uses exactly the same logic and it works perfectly fine there.

You can see the code here: https://github.com/nihirv/controllable-vqg/tree/lightning_branch. I've set up the data loader to create dummy examples so once you clone it, all you need to do is run this command: python3 main.py --num_warmup_steps 0.

Here's how the code works and where the bug is coming from. After a certain amount of warmup steps, I want to start training a latent variable model (VAE). I have a flag in model.py (line 33, self.latent_transformer) which is used to control whether we're going to run the VAE or not. After num_warmup_steps have passed, that is when I want to turn on the variational training. This 'turning on' happens on line 69 in main.py, where I check if num_warmup_steps < training_iterations and run self.model.switch_latent_transformer(True) which simply sets self.latent_transformer in model.py to True.

In the forward() (and decode_greedy()) method of my model I have an if statement which checks self.latent_transformer and runs the latent module (self.latent) if it is true. This if is where the bug is coming from. In model.py, if you move L77 and L126 up one line (or comment out JUST the if lines on L76 and L125), the model works fine. The presence of the if there seems to cause issues.

This is very weird because as I said, I have another codebase which has basically the same logic and that runs fine... not sure what the difference is here

There seem to many dependencies I need to install manually. It looks like you don't have a requirements file in that repo. I'm afraid I won't have the time to debug this code unless it's a minimal.

Make sure that when you switch to the new model, you have it on the right device. I suggest you put a debugger breakpoint at the points in the code where you get the outputs of the submodels and inspect the model weights and input/outputs to see if they are on the cuda device.

Do you have any intuition about an if statement causing things to be placed on different devices?

Fixed the bug! Thanks for the advice about the debugger - wouldn't have been able to solve it without it. This isn't a strictly a lightning issue but having a useful warnings for self.log could be worthwhile. Basically, I had a calculate_losses() method in my pl.LightningModule class:

def calculate_losses(self, loss, kld, r=0.5):
    if kld is None:
        loss_rec = loss # not relevant to bug
        total_loss = loss # not relevant to bug
        loss_kl = torch.tensor(0) # THIS LINE CAUSES THE BUG
    else:
        #some other stuff
        total_loss = loss + beta * kld
        return total_loss, loss_rec, loss_kl

The function checked whether a parameter (in this case kld) is None or not. If it is None, then we have loss_kl = torch.tensor(0). In my training_step, I did a self.log('kl train loss', loss_kl) of the returned loss_kl (whether or not kld was None). Eventually (internal to Lightning), collate_tensors in step_result.py was called (line 699). The stack-trace is as follows:

collate_tensors (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\core\step_result.py:699)
recursive_stack (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\core\step_result.py:677)
reduce_on_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\core\step_result.py:535)
auto_reduce_results_on_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\connectors\logger_connector\epoch_result_store.py:183)
auto_reduce_results_on_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\connectors\logger_connector\epoch_result_store.py:376)
has_batch_loop_finished (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\connectors\logger_connector\epoch_result_store.py:386)
on_train_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\connectors\logger_connector\logger_connector.py:373)
on_train_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\training_loop.py:802)
run_training_epoch (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\training_loop.py:557)
run_train (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\trainer.py:637)
start_training (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\plugins\training_type\training_type_plugin.py:114)
start_training (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\accelerators\accelerator.py:73)
dispatch (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\trainer.py:546)
fit (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\trainer.py:499)
<module> (\data\nv419\machine_drive\nihir-vqg\main.py:266)
_run_code (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\runpy.py:87)
_run_module_code (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\runpy.py:97)
run_path (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\runpy.py:265)
_run_code (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\runpy.py:87)
_run_module_as_main (Current frame) (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\runpy.py:194)

Now, on the most RECENT call to auto_reduce_results_on_epoch_end (i.e. auto_reduce_results_on_epoch_end (\data\nv419\anaconda3\envs\blt-vqg\lib\python3.8\site-packages\pytorch_lightning\trainer\connectors\logger_connector\epoch_result_store.py:183)), I see that there's an attribute called outputs. Inside of outputs there's a batch size amount of keys which store the logged metrics/losses. For the first batch only (i.e. key 0), I see that my kl train loss isn't placed on GPU (this is with the gpus argument on the trainer enabled).

I couldn't find this kl train loss for batch item 0 in previous stacks, but I'm sure it does exist there.

Anyway, my solution was to place the loss_kl = torch.tensor(0) on gpu. Specifically:

def calculate_losses(self, loss, kld, r=0.5):
    if kld is None:
        loss_rec = loss # not relevant to bug
        total_loss = loss # not relevant to bug
        loss_kl = torch.tensor(0).to(self.args.device) # FIX. self.args.device = "cuda:0"
    else:
        #some other stuff
        total_loss = loss + beta * kld
        return total_loss, loss_rec, loss_kl

Feel free to close or let me know if you have any follow-up questions

Great. Happy you were able to find it yourself with the debugger. Btw, LightningModule knows which device it is on. Instead of self.args.deviceyou can just ask for self.device directly.

... or, generally also outside LightningModule in pure pytorch, this is also a solution:

total_loss = loss 
loss_kl = torch.tensor(0).to(loss.device)
# or
loss_kl = torch.zeros_like(loss)

FWIW I noticed the same runtime error in Lightning v1.2.5 under the specific operation order of

trainer.fit()
trainer.test()
model.to_onnx()

If trainer.test() is not run, then I have no error. Upgrading to v1.3.0 solved the issue, so I won't worry about a minimal repro with the BoringModel unless someone thinks that would still be useful. Mostly noting for reference of future readers

Lightning-AI / pytorch-lightning