Open shankartmv opened 4 months ago
After some debugging I figured out a way to get around this problem. By resizing my images to standard 3:2 aspect ratio, (1024*720) I can see that the input and output shapes (obtained from pytorch.summary) of my AutoEncoderKL is consistent. But anyways, I would like to know the reason behind this error.
I believe this is caused by downsampling and upsampling on data with a nan 2 power dimension.
I think this happens cause you have downsamplings that divide the spatial dimensions by 2 and upsample, so unless you play around with the paddings and strides to make sure things end up having the same size, you might run into errors. I would recommend simply padding your inputs to a size that is consistently divisible by 2.
I am trying to train a AutoEncoderKL model on RGB images with the following dimensions (3,1225,966). Here is the code that I use ( similar to what's there in tutorials/generative/2d_ldm/2d_ldm_tutorial.ipynb ). autoencoderkl = AutoencoderKL( spatial_dims=2, in_channels=3, out_channels=3, num_channels=(128, 256, 384), latent_channels=8, num_res_blocks=1, attention_levels=(False, False, False), with_encoder_nonlocal_attn=False, with_decoder_nonlocal_attn=False, ) autoencoderkl = autoencoderkl.to(device)
Error is reported at line 27 (Train Model - as in the tutorials notebook)
recons_loss = F.l1_loss(reconstruction.float(), images.float()) RuntimeError: The size of tensor a (964) must match the size of tensor b (966) at non-singleton dimension 3
Using pytorchinfo package , I was able to print the model summary and can find the discrepancy in the upsampling layer.