Error in Logger on epoch end when using Multiple GPUs

kchuang625 commented 3 years ago

🐛 Bug

When using multiple GPUs with 'dp', the error RuntimeError: All input tensors must be on the same device. Received cuda:1 and cuda:0 occurs. It means the collections on epoch end would be from different device.

Expected behavior

While they might need to be on the same device, or maybe the aggregating function should be able to handle items from different device.

Environment

PyTorch Version (e.g., 1.0): 1.7.0
OS (e.g., Linux): Linux
How you installed PyTorch (conda, pip, source): pip
Python version: 3.8.0
CUDA/cuDNN version: 10.2
Any other relevant information: pytorch-lightning==1.0.8

A quick but not safe solution

modify collate_tensors function in pytorch_lightning/core/step_result.py:

def collate_tensors(items: Union[List, Tuple]) -> Union[Tensor, List, Tuple]:

    if not items or not isinstance(items, (list, tuple)) or any(not isinstance(item, Tensor) for item in items):
        # items is not a sequence, empty, or contains non-tensors
        return items

    # add the following line of code
    items = [item.type_as(items[0]) for item in items]

    if all(item.ndim == 0 for item in items):
        # all tensors are scalars, we need to stack
        return torch.stack(items)

    if all(item.ndim >= 1 and item.shape[1:] == items[0].shape[1:] for item in items):
        # we can concatenate along the first dimension
        return torch.cat(items)

    return items

edenlightning commented 3 years ago

Thanks for the issue! Mind try to reproduce with boring model and sharing the code?

kchuang625 commented 3 years ago

Yes, I tried the boring model, and it worked fine. However with the same module but the different backbone, it crushed.

Firstly, I set gpus=1 and returned {'loss': loss} in training_step, and the error RuntimeError: grad can be implicitly created only for scalar outputs occured at the first training step (Also I printed the returned item: {'loss': tensor(1755106.8750, device='cuda:0', grad_fn=<AddBackward0>)}). So I returned loss directly instead of the dictionary and it worked fine.

After that, I simply changed gpus to be 2, the error RuntimeError: All input tensors must be on the same device. Received cuda:1 and cuda:0 happened on the epoch end.

I think there might be something wrong in collecting items and backwarding losses in training steps.

kchuang625 commented 3 years ago

here is my module:

class ResNetVAE(pl.LightningModule):
    def __init__(
        self,
        lr,
        weight_decay,
        fc_hidden1=1024,
        fc_hidden2=1024,
        drop_p=0.2,
        CNN_embed_dim=256
    ):
        super().__init__()

        self.lr = lr
        self.weight_decay = weight_decay
        self.fc_hidden1, self.fc_hidden2, self.CNN_embed_dim = fc_hidden1, fc_hidden2, CNN_embed_dim

        # CNN architechtures
        self.ch1, self.ch2, self.ch3, self.ch4 = 16, 32, 64, 128
        self.k1, self.k2, self.k3, self.k4 = (5, 5), (3, 3), (3, 3), (3, 3)      # 2d kernal size
        self.s1, self.s2, self.s3, self.s4 = (2, 2), (2, 2), (2, 2), (2, 2)      # 2d strides
        self.pd1, self.pd2, self.pd3, self.pd4 = (0, 0), (0, 0), (0, 0), (0, 0)  # 2d padding

        # encoding components
        resnet = models.resnet152(pretrained=True)
        modules = list(resnet.children())[:-1]      # delete the last fc layer.
        self.resnet = nn.Sequential(*modules)
        self.fc1 = nn.Linear(resnet.fc.in_features, self.fc_hidden1)
        self.bn1 = nn.BatchNorm1d(self.fc_hidden1, momentum=0.01)
        self.fc2 = nn.Linear(self.fc_hidden1, self.fc_hidden2)
        self.bn2 = nn.BatchNorm1d(self.fc_hidden2, momentum=0.01)
        # Latent vectors mu and sigma
        self.fc3_mu = nn.Linear(self.fc_hidden2, self.CNN_embed_dim)      # output = CNN embedding latent variables
        self.fc3_logvar = nn.Linear(self.fc_hidden2, self.CNN_embed_dim)  # output = CNN embedding latent variables

        # Sampling vector
        self.fc4 = nn.Linear(self.CNN_embed_dim, self.fc_hidden2)
        self.fc_bn4 = nn.BatchNorm1d(self.fc_hidden2)
        self.fc5 = nn.Linear(self.fc_hidden2, 64 * 4 * 4)
        self.fc_bn5 = nn.BatchNorm1d(64 * 4 * 4)
        self.relu = nn.ReLU(inplace=True)

        # Decoder
        self.convTrans6 = nn.Sequential(
            nn.ConvTranspose2d(in_channels=64, out_channels=32, kernel_size=self.k4, stride=self.s4,
                               padding=self.pd4),
            nn.BatchNorm2d(32, momentum=0.01),
            nn.ReLU(inplace=True),
        )
        self.convTrans7 = nn.Sequential(
            nn.ConvTranspose2d(in_channels=32, out_channels=8, kernel_size=self.k3, stride=self.s3,
                               padding=self.pd3),
            nn.BatchNorm2d(8, momentum=0.01),
            nn.ReLU(inplace=True),
        )

        self.convTrans8 = nn.Sequential(
            nn.ConvTranspose2d(in_channels=8, out_channels=3, kernel_size=self.k2, stride=self.s2,
                               padding=self.pd2),
            nn.BatchNorm2d(3, momentum=0.01),
            nn.Sigmoid()
        )

    def loss_function(self, recon_x, x, mu, logvar):
        MSE = F.binary_cross_entropy(recon_x, x, reduction='sum')
        KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
        return MSE + KLD*weight_kld

    def training_step(self, batch, batch_idx):
        return self._step(batch, 'train_loss') # ['loss']

    def validation_step(self, batch, batch_idx):
        return self._step(batch, 'valid_loss')

    def _step(self, batch, name):
        x_reconst, z, mu, logvar = self._forward(batch)
        loss = self.loss_function(x_reconst, batch, mu, logvar)
        self.log(
            name,
            loss,
            on_step=True,
            on_epoch=True,
            prog_bar=True,
            logger=True,
        )
        return {'loss': loss}

    def configure_optimizers(self):
        return torch.optim.Adam(
            self.parameters(),
            lr=self.lr,
            weight_decay=self.weight_decay
        )

    def _encode(self, x):
        # ResNet
        x = self.resnet(x)
        x = x.view(x.size(0), -1)

        # FC layers
        x = self.bn1(self.fc1(x))
        x = self.relu(x)
        x = self.bn2(self.fc2(x))
        x = self.relu(x)
        mu, logvar = self.fc3_mu(x), self.fc3_logvar(x)
        return mu, logvar

    def _reparameterize(self, mu, logvar):
        std = logvar.mul(0.5).exp_()
        eps = std.data.new(std.size()).normal_()
        return eps.mul(std).add_(mu)

    def _decode(self, z):
        x = self.relu(self.fc_bn4(self.fc4(z)))
        x = self.relu(self.fc_bn5(self.fc5(x))).view(-1, 64, 4, 4)
        x = self.convTrans6(x)
        x = self.convTrans7(x)
        x = self.convTrans8(x)
        x = F.interpolate(x, size=(224, 224), mode='bilinear')
        return x

    def _forward(self, x):
        mu, logvar = self._encode(x)
        z = self._reparameterize(mu, logvar)
        x_reconst = self._decode(z)
        return x_reconst, z, mu, logvar

carmocca commented 3 years ago

Hi @kchuang625!

I am trying to replicate it in colab (https://colab.research.google.com/drive/1nRhiaMFPdc8vh7hAX-u3bYGrpkGVowcb?usp=sharing) but i'm getting NameError: name 'weight_kld' is not defined

Can you update the snippet?

kchuang625 commented 3 years ago

Hi @carmocca, thanks for reply!

Here is the updated notebook: https://colab.research.google.com/drive/1Ra1T6Jdqq8U0GNDrhZc37JgSzR8jYvFc?usp=sharing

Note.

the input tensor should be in [0, 1] (for bce loss).
training and validating mode might have different behaviors, so I gave train and valid dataloader to trainer.

carmocca commented 3 years ago

I'm getting the following error (not the reported error) when I try to run it on DP.

ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 1024])

I'm assuming it's due to the model architecure. Can you take a look @kchuang625 ?

kchuang625 commented 3 years ago

@carmocca Oh! It's because there is BatchNorm layer in the model. Simply modify the dataset and dataloader by the following code and it could be tested for a 2-gpu process:

class RandomDataset(torch.utils.data.Dataset):
    def __getitem__(self, index):
        return torch.rand(3, 224, 224)

    def __len__(self):
        return 32

train_loader = torch.utils.data.DataLoader(
    RandomDataset(),
    batch_size=4,
)

valid_loader = torch.utils.data.DataLoader(
    RandomDataset(),
    batch_size=4,
)

carmocca commented 3 years ago

Looks like this has been fixed already since it works with current master.

Note that you will have to update your step function to:

    def _step(self, batch, name):
        ...
        return loss  # was {'loss': loss}

Otherwise you'll get RuntimeError: grad can be implicitly created only for scalar outputs

Please try yourself and close this issue if it works for you 😄

kchuang625 commented 3 years ago

@carmocca thanks for the reply!

I did test the script when pl 1.1.6 was released, and it turned out that changing the returned type from Dict to torch.Tensor is actually my temporary solution for now! It made me not need to manually modify the source code 😄

However, I have the impression that it's supported to return Dict with loss key in 0.x.x. (cuz it's really convenient to return other things by the way)

I wonder if this feature had been deprecated?

carmocca commented 3 years ago

You are correct, it should work.

Do you mind opening a new issue about this? Since the original purpose of this one was to fix: RuntimeError: All input tensors must be on the same device. Received cuda:1 and cuda:0

Tag me and i'll take a look. Thanks!

Lightning-AI / pytorch-lightning