Closed kchuang625 closed 3 years ago
Thanks for the issue! Mind try to reproduce with boring model and sharing the code?
Yes, I tried the boring model, and it worked fine. However with the same module but the different backbone, it crushed.
Firstly, I set gpus=1
and returned {'loss': loss}
in training_step
, and the error RuntimeError: grad can be implicitly created only for scalar outputs
occured at the first training step (Also I printed the returned item: {'loss': tensor(1755106.8750, device='cuda:0', grad_fn=<AddBackward0>)}
). So I returned loss
directly instead of the dictionary and it worked fine.
After that, I simply changed gpus
to be 2, the error RuntimeError: All input tensors must be on the same device. Received cuda:1 and cuda:0
happened on the epoch end.
I think there might be something wrong in collecting items and backwarding losses in training steps.
here is my module:
class ResNetVAE(pl.LightningModule):
def __init__(
self,
lr,
weight_decay,
fc_hidden1=1024,
fc_hidden2=1024,
drop_p=0.2,
CNN_embed_dim=256
):
super().__init__()
self.lr = lr
self.weight_decay = weight_decay
self.fc_hidden1, self.fc_hidden2, self.CNN_embed_dim = fc_hidden1, fc_hidden2, CNN_embed_dim
# CNN architechtures
self.ch1, self.ch2, self.ch3, self.ch4 = 16, 32, 64, 128
self.k1, self.k2, self.k3, self.k4 = (5, 5), (3, 3), (3, 3), (3, 3) # 2d kernal size
self.s1, self.s2, self.s3, self.s4 = (2, 2), (2, 2), (2, 2), (2, 2) # 2d strides
self.pd1, self.pd2, self.pd3, self.pd4 = (0, 0), (0, 0), (0, 0), (0, 0) # 2d padding
# encoding components
resnet = models.resnet152(pretrained=True)
modules = list(resnet.children())[:-1] # delete the last fc layer.
self.resnet = nn.Sequential(*modules)
self.fc1 = nn.Linear(resnet.fc.in_features, self.fc_hidden1)
self.bn1 = nn.BatchNorm1d(self.fc_hidden1, momentum=0.01)
self.fc2 = nn.Linear(self.fc_hidden1, self.fc_hidden2)
self.bn2 = nn.BatchNorm1d(self.fc_hidden2, momentum=0.01)
# Latent vectors mu and sigma
self.fc3_mu = nn.Linear(self.fc_hidden2, self.CNN_embed_dim) # output = CNN embedding latent variables
self.fc3_logvar = nn.Linear(self.fc_hidden2, self.CNN_embed_dim) # output = CNN embedding latent variables
# Sampling vector
self.fc4 = nn.Linear(self.CNN_embed_dim, self.fc_hidden2)
self.fc_bn4 = nn.BatchNorm1d(self.fc_hidden2)
self.fc5 = nn.Linear(self.fc_hidden2, 64 * 4 * 4)
self.fc_bn5 = nn.BatchNorm1d(64 * 4 * 4)
self.relu = nn.ReLU(inplace=True)
# Decoder
self.convTrans6 = nn.Sequential(
nn.ConvTranspose2d(in_channels=64, out_channels=32, kernel_size=self.k4, stride=self.s4,
padding=self.pd4),
nn.BatchNorm2d(32, momentum=0.01),
nn.ReLU(inplace=True),
)
self.convTrans7 = nn.Sequential(
nn.ConvTranspose2d(in_channels=32, out_channels=8, kernel_size=self.k3, stride=self.s3,
padding=self.pd3),
nn.BatchNorm2d(8, momentum=0.01),
nn.ReLU(inplace=True),
)
self.convTrans8 = nn.Sequential(
nn.ConvTranspose2d(in_channels=8, out_channels=3, kernel_size=self.k2, stride=self.s2,
padding=self.pd2),
nn.BatchNorm2d(3, momentum=0.01),
nn.Sigmoid()
)
def loss_function(self, recon_x, x, mu, logvar):
MSE = F.binary_cross_entropy(recon_x, x, reduction='sum')
KLD = -0.5 * torch.sum(1 + logvar - mu.pow(2) - logvar.exp())
return MSE + KLD*weight_kld
def training_step(self, batch, batch_idx):
return self._step(batch, 'train_loss') # ['loss']
def validation_step(self, batch, batch_idx):
return self._step(batch, 'valid_loss')
def _step(self, batch, name):
x_reconst, z, mu, logvar = self._forward(batch)
loss = self.loss_function(x_reconst, batch, mu, logvar)
self.log(
name,
loss,
on_step=True,
on_epoch=True,
prog_bar=True,
logger=True,
)
return {'loss': loss}
def configure_optimizers(self):
return torch.optim.Adam(
self.parameters(),
lr=self.lr,
weight_decay=self.weight_decay
)
def _encode(self, x):
# ResNet
x = self.resnet(x)
x = x.view(x.size(0), -1)
# FC layers
x = self.bn1(self.fc1(x))
x = self.relu(x)
x = self.bn2(self.fc2(x))
x = self.relu(x)
mu, logvar = self.fc3_mu(x), self.fc3_logvar(x)
return mu, logvar
def _reparameterize(self, mu, logvar):
std = logvar.mul(0.5).exp_()
eps = std.data.new(std.size()).normal_()
return eps.mul(std).add_(mu)
def _decode(self, z):
x = self.relu(self.fc_bn4(self.fc4(z)))
x = self.relu(self.fc_bn5(self.fc5(x))).view(-1, 64, 4, 4)
x = self.convTrans6(x)
x = self.convTrans7(x)
x = self.convTrans8(x)
x = F.interpolate(x, size=(224, 224), mode='bilinear')
return x
def _forward(self, x):
mu, logvar = self._encode(x)
z = self._reparameterize(mu, logvar)
x_reconst = self._decode(z)
return x_reconst, z, mu, logvar
Hi @kchuang625!
I am trying to replicate it in colab (https://colab.research.google.com/drive/1nRhiaMFPdc8vh7hAX-u3bYGrpkGVowcb?usp=sharing) but i'm getting NameError: name 'weight_kld' is not defined
Can you update the snippet?
Hi @carmocca, thanks for reply!
Here is the updated notebook: https://colab.research.google.com/drive/1Ra1T6Jdqq8U0GNDrhZc37JgSzR8jYvFc?usp=sharing
Note.
I'm getting the following error (not the reported error) when I try to run it on DP.
ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 1024])
I'm assuming it's due to the model architecure. Can you take a look @kchuang625 ?
@carmocca Oh! It's because there is BatchNorm
layer in the model. Simply modify the dataset and dataloader by the following code and it could be tested for a 2-gpu process:
class RandomDataset(torch.utils.data.Dataset):
def __getitem__(self, index):
return torch.rand(3, 224, 224)
def __len__(self):
return 32
train_loader = torch.utils.data.DataLoader(
RandomDataset(),
batch_size=4,
)
valid_loader = torch.utils.data.DataLoader(
RandomDataset(),
batch_size=4,
)
Looks like this has been fixed already since it works with current master.
Note that you will have to update your step function to:
def _step(self, batch, name):
...
return loss # was {'loss': loss}
Otherwise you'll get RuntimeError: grad can be implicitly created only for scalar outputs
Please try yourself and close this issue if it works for you 😄
@carmocca thanks for the reply!
I did test the script when pl 1.1.6 was released, and it turned out that changing the returned type from Dict
to torch.Tensor
is actually my temporary solution for now! It made me not need to manually modify the source code 😄
However, I have the impression that it's supported to return Dict
with loss
key in 0.x.x. (cuz it's really convenient to return other things by the way)
I wonder if this feature had been deprecated?
You are correct, it should work.
Do you mind opening a new issue about this? Since the original purpose of this one was to fix:
RuntimeError: All input tensors must be on the same device. Received cuda:1 and cuda:0
Tag me and i'll take a look. Thanks!
🐛 Bug
When using multiple GPUs with 'dp', the error
RuntimeError: All input tensors must be on the same device. Received cuda:1 and cuda:0
occurs. It means the collections on epoch end would be from different device.Expected behavior
While they might need to be on the same device, or maybe the aggregating function should be able to handle items from different device.
Environment
conda
,pip
, source): pipA quick but not safe solution
collate_tensors
function inpytorch_lightning/core/step_result.py
: