Project-MONAI / GenerativeModels

MONAI Generative Models makes it easy to train, evaluate, and deploy generative models and related applications
Apache License 2.0
624 stars 88 forks source link

AutoEncoderKL output tensor dimension mismatch with Input #498

Open shankartmv opened 4 months ago

shankartmv commented 4 months ago

I am trying to train a AutoEncoderKL model on RGB images with the following dimensions (3,1225,966). Here is the code that I use ( similar to what's there in tutorials/generative/2d_ldm/2d_ldm_tutorial.ipynb ). autoencoderkl = AutoencoderKL( spatial_dims=2, in_channels=3, out_channels=3, num_channels=(128, 256, 384), latent_channels=8, num_res_blocks=1, attention_levels=(False, False, False), with_encoder_nonlocal_attn=False, with_decoder_nonlocal_attn=False, ) autoencoderkl = autoencoderkl.to(device)

Error is reported at line 27 (Train Model - as in the tutorials notebook) recons_loss = F.l1_loss(reconstruction.float(), images.float()) RuntimeError: The size of tensor a (964) must match the size of tensor b (966) at non-singleton dimension 3

Using pytorchinfo package , I was able to print the model summary and can find the discrepancy in the upsampling layer.

=================================================================================================================== Layer (type:depth-idx) Input Shape Output Shape Param #

AutoencoderKL [1, 3, 1225, 966] [1, 3, 1224, 964] -- ├─Encoder: 1-1 [1, 3, 1225, 966] [1, 8, 306, 241] -- │ └─ModuleList: 2-1 -- -- -- │ │ └─Convolution: 3-1 [1, 3, 1225, 966] [1, 128, 1225, 966] 3,584 │ │ └─ResBlock: 3-2 [1, 128, 1225, 966] [1, 128, 1225, 966] 295,680 │ │ └─Downsample: 3-3 [1, 128, 1225, 966] [1, 128, 612, 483] 147,584 │ │ └─ResBlock: 3-4 [1, 128, 612, 483] [1, 256, 612, 483] 919,040 │ │ └─Downsample: 3-5 [1, 256, 612, 483] [1, 256, 306, 241] 590,080 │ │ └─ResBlock: 3-6 [1, 256, 306, 241] [1, 384, 306, 241] 2,312,576 │ │ └─GroupNorm: 3-7 [1, 384, 306, 241] [1, 384, 306, 241] 768 │ │ └─Convolution: 3-8 [1, 384, 306, 241] [1, 8, 306, 241] 27,656 ├─Convolution: 1-2 [1, 8, 306, 241] [1, 8, 306, 241] -- │ └─Conv2d: 2-2 [1, 8, 306, 241] [1, 8, 306, 241] 72 ├─Convolution: 1-3 [1, 8, 306, 241] [1, 8, 306, 241] -- │ └─Conv2d: 2-3 [1, 8, 306, 241] [1, 8, 306, 241] 72 ├─Convolution: 1-4 [1, 8, 306, 241] [1, 8, 306, 241] -- │ └─Conv2d: 2-4 [1, 8, 306, 241] [1, 8, 306, 241] 72 ├─Decoder: 1-5 [1, 8, 306, 241] [1, 3, 1224, 964] -- │ └─ModuleList: 2-5 -- -- -- │ │ └─Convolution: 3-9 [1, 8, 306, 241] [1, 384, 306, 241] 28,032 │ │ └─ResBlock: 3-10 [1, 384, 306, 241] [1, 384, 306, 241] 2,656,512 │ │ └─Upsample: 3-11 [1, 384, 306, 241] [1, 384, 612, 482] 1,327,488 │ │ └─ResBlock: 3-12 [1, 384, 612, 482] [1, 256, 612, 482] 1,574,912 │ │ └─Upsample: 3-13 [1, 256, 612, 482] [1, 256, 1224, 964] 590,080 │ │ └─ResBlock: 3-14 [1, 256, 1224, 964] [1, 128, 1224, 964] 476,288 │ │ └─GroupNorm: 3-15 [1, 128, 1224, 964] [1, 128, 1224, 964] 256 │ │ └─Convolution: 3-16 [1, 128, 1224, 964] [1, 3, 1224, 964] 3,459

Total params: 10,954,211 Trainable params: 10,954,211 Non-trainable params: 0 Total mult-adds (Units.TERABYTES): 3.20

Input size (MB): 14.20 Forward/backward pass size (MB): 26803.57 Params size (MB): 43.82 Estimated Total Size (MB): 26861.59

shankartmv commented 4 months ago

After some debugging I figured out a way to get around this problem. By resizing my images to standard 3:2 aspect ratio, (1024*720) I can see that the input and output shapes (obtained from pytorch.summary) of my AutoEncoderKL is consistent. But anyways, I would like to know the reason behind this error.

xmhGit commented 3 months ago

I believe this is caused by downsampling and upsampling on data with a nan 2 power dimension.

virginiafdez commented 4 weeks ago

I think this happens cause you have downsamplings that divide the spatial dimensions by 2 and upsample, so unless you play around with the paddings and strides to make sure things end up having the same size, you might run into errors. I would recommend simply padding your inputs to a size that is consistently divisible by 2.