AutoEncoderKL output tensor dimension mismatch with Input

shankartmv commented 1 month ago

I am trying to train a AutoEncoderKL model on RGB images with the following dimensions (3,1225,966). Here is the code that I use ( similar to what's there in tutorials/generative/2d_ldm/2d_ldm_tutorial.ipynb ). autoencoderkl = AutoencoderKL( spatial_dims=2, in_channels=3, out_channels=3, num_channels=(128, 256, 384), latent_channels=8, num_res_blocks=1, attention_levels=(False, False, False), with_encoder_nonlocal_attn=False, with_decoder_nonlocal_attn=False, ) autoencoderkl = autoencoderkl.to(device)

Error is reported at line 27 (Train Model - as in the tutorials notebook) recons_loss = F.l1_loss(reconstruction.float(), images.float()) RuntimeError: The size of tensor a (964) must match the size of tensor b (966) at non-singleton dimension 3

Using pytorchinfo package , I was able to print the model summary and can find the discrepancy in the upsampling layer.

=================================================================================================================== Layer (type:depth-idx) Input Shape Output Shape Param #

AutoencoderKL [1, 3, 1225, 966] [1, 3, 1224, 964] -- ├─Encoder: 1-1 [1, 3, 1225, 966] [1, 8, 306, 241] -- │ └─ModuleList: 2-1 -- -- -- │ │ └─Convolution: 3-1 [1, 3, 1225, 966] [1, 128, 1225, 966] 3,584 │ │ └─ResBlock: 3-2 [1, 128, 1225, 966] [1, 128, 1225, 966] 295,680 │ │ └─Downsample: 3-3 [1, 128, 1225, 966] [1, 128, 612, 483] 147,584 │ │ └─ResBlock: 3-4 [1, 128, 612, 483] [1, 256, 612, 483] 919,040 │ │ └─Downsample: 3-5 [1, 256, 612, 483] [1, 256, 306, 241] 590,080 │ │ └─ResBlock: 3-6 [1, 256, 306, 241] [1, 384, 306, 241] 2,312,576 │ │ └─GroupNorm: 3-7 [1, 384, 306, 241] [1, 384, 306, 241] 768 │ │ └─Convolution: 3-8 [1, 384, 306, 241] [1, 8, 306, 241] 27,656 ├─Convolution: 1-2 [1, 8, 306, 241] [1, 8, 306, 241] -- │ └─Conv2d: 2-2 [1, 8, 306, 241] [1, 8, 306, 241] 72 ├─Convolution: 1-3 [1, 8, 306, 241] [1, 8, 306, 241] -- │ └─Conv2d: 2-3 [1, 8, 306, 241] [1, 8, 306, 241] 72 ├─Convolution: 1-4 [1, 8, 306, 241] [1, 8, 306, 241] -- │ └─Conv2d: 2-4 [1, 8, 306, 241] [1, 8, 306, 241] 72 ├─Decoder: 1-5 [1, 8, 306, 241] [1, 3, 1224, 964] -- │ └─ModuleList: 2-5 -- -- -- │ │ └─Convolution: 3-9 [1, 8, 306, 241] [1, 384, 306, 241] 28,032 │ │ └─ResBlock: 3-10 [1, 384, 306, 241] [1, 384, 306, 241] 2,656,512 │ │ └─Upsample: 3-11 [1, 384, 306, 241] [1, 384, 612, 482] 1,327,488 │ │ └─ResBlock: 3-12 [1, 384, 612, 482] [1, 256, 612, 482] 1,574,912 │ │ └─Upsample: 3-13 [1, 256, 612, 482] [1, 256, 1224, 964] 590,080 │ │ └─ResBlock: 3-14 [1, 256, 1224, 964] [1, 128, 1224, 964] 476,288 │ │ └─GroupNorm: 3-15 [1, 128, 1224, 964] [1, 128, 1224, 964] 256 │ │ └─Convolution: 3-16 [1, 128, 1224, 964] [1, 3, 1224, 964] 3,459

Total params: 10,954,211 Trainable params: 10,954,211 Non-trainable params: 0 Total mult-adds (Units.TERABYTES): 3.20

Input size (MB): 14.20 Forward/backward pass size (MB): 26803.57 Params size (MB): 43.82 Estimated Total Size (MB): 26861.59

shankartmv commented 1 month ago

After some debugging I figured out a way to get around this problem. By resizing my images to standard 3:2 aspect ratio, (1024*720) I can see that the input and output shapes (obtained from pytorch.summary) of my AutoEncoderKL is consistent. But anyways, I would like to know the reason behind this error.

xmhGit commented 1 month ago

I believe this is caused by downsampling and upsampling on data with a nan 2 power dimension.

Project-MONAI / GenerativeModels

AutoEncoderKL output tensor dimension mismatch with Input #498

=================================================================================================================== Layer (type:depth-idx) Input Shape Output Shape Param #

Total params: 10,954,211 Trainable params: 10,954,211 Non-trainable params: 0 Total mult-adds (Units.TERABYTES): 3.20

Input size (MB): 14.20 Forward/backward pass size (MB): 26803.57 Params size (MB): 43.82 Estimated Total Size (MB): 26861.59