Project-MONAI / GenerativeModels

MONAI Generative Models makes it easy to train, evaluate, and deploy generative models and related applications
Apache License 2.0
592 stars 81 forks source link

AutoEncoderKL output tensor dimension mismatch with Input #498

Open shankartmv opened 1 month ago

shankartmv commented 1 month ago

I am trying to train a AutoEncoderKL model on RGB images with the following dimensions (3,1225,966). Here is the code that I use ( similar to what's there in tutorials/generative/2d_ldm/2d_ldm_tutorial.ipynb ). autoencoderkl = AutoencoderKL( spatial_dims=2, in_channels=3, out_channels=3, num_channels=(128, 256, 384), latent_channels=8, num_res_blocks=1, attention_levels=(False, False, False), with_encoder_nonlocal_attn=False, with_decoder_nonlocal_attn=False, ) autoencoderkl = autoencoderkl.to(device)

Error is reported at line 27 (Train Model - as in the tutorials notebook) recons_loss = F.l1_loss(reconstruction.float(), images.float()) RuntimeError: The size of tensor a (964) must match the size of tensor b (966) at non-singleton dimension 3

Using pytorchinfo package , I was able to print the model summary and can find the discrepancy in the upsampling layer.

=================================================================================================================== Layer (type:depth-idx) Input Shape Output Shape Param #

AutoencoderKL [1, 3, 1225, 966] [1, 3, 1224, 964] -- ├─Encoder: 1-1 [1, 3, 1225, 966] [1, 8, 306, 241] -- │ └─ModuleList: 2-1 -- -- -- │ │ └─Convolution: 3-1 [1, 3, 1225, 966] [1, 128, 1225, 966] 3,584 │ │ └─ResBlock: 3-2 [1, 128, 1225, 966] [1, 128, 1225, 966] 295,680 │ │ └─Downsample: 3-3 [1, 128, 1225, 966] [1, 128, 612, 483] 147,584 │ │ └─ResBlock: 3-4 [1, 128, 612, 483] [1, 256, 612, 483] 919,040 │ │ └─Downsample: 3-5 [1, 256, 612, 483] [1, 256, 306, 241] 590,080 │ │ └─ResBlock: 3-6 [1, 256, 306, 241] [1, 384, 306, 241] 2,312,576 │ │ └─GroupNorm: 3-7 [1, 384, 306, 241] [1, 384, 306, 241] 768 │ │ └─Convolution: 3-8 [1, 384, 306, 241] [1, 8, 306, 241] 27,656 ├─Convolution: 1-2 [1, 8, 306, 241] [1, 8, 306, 241] -- │ └─Conv2d: 2-2 [1, 8, 306, 241] [1, 8, 306, 241] 72 ├─Convolution: 1-3 [1, 8, 306, 241] [1, 8, 306, 241] -- │ └─Conv2d: 2-3 [1, 8, 306, 241] [1, 8, 306, 241] 72 ├─Convolution: 1-4 [1, 8, 306, 241] [1, 8, 306, 241] -- │ └─Conv2d: 2-4 [1, 8, 306, 241] [1, 8, 306, 241] 72 ├─Decoder: 1-5 [1, 8, 306, 241] [1, 3, 1224, 964] -- │ └─ModuleList: 2-5 -- -- -- │ │ └─Convolution: 3-9 [1, 8, 306, 241] [1, 384, 306, 241] 28,032 │ │ └─ResBlock: 3-10 [1, 384, 306, 241] [1, 384, 306, 241] 2,656,512 │ │ └─Upsample: 3-11 [1, 384, 306, 241] [1, 384, 612, 482] 1,327,488 │ │ └─ResBlock: 3-12 [1, 384, 612, 482] [1, 256, 612, 482] 1,574,912 │ │ └─Upsample: 3-13 [1, 256, 612, 482] [1, 256, 1224, 964] 590,080 │ │ └─ResBlock: 3-14 [1, 256, 1224, 964] [1, 128, 1224, 964] 476,288 │ │ └─GroupNorm: 3-15 [1, 128, 1224, 964] [1, 128, 1224, 964] 256 │ │ └─Convolution: 3-16 [1, 128, 1224, 964] [1, 3, 1224, 964] 3,459

Total params: 10,954,211 Trainable params: 10,954,211 Non-trainable params: 0 Total mult-adds (Units.TERABYTES): 3.20

Input size (MB): 14.20 Forward/backward pass size (MB): 26803.57 Params size (MB): 43.82 Estimated Total Size (MB): 26861.59

shankartmv commented 1 month ago

After some debugging I figured out a way to get around this problem. By resizing my images to standard 3:2 aspect ratio, (1024*720) I can see that the input and output shapes (obtained from pytorch.summary) of my AutoEncoderKL is consistent. But anyways, I would like to know the reason behind this error.

xmhGit commented 1 month ago

I believe this is caused by downsampling and upsampling on data with a nan 2 power dimension.