Open thwan11 opened 1 month ago
@thwan11 Thank you for sharing this diagram! (Because the original diagram is wrong)
This is the revised diagram (we note that the stride of the convolution at the upsampling layer is converted to one).
For the resizing method, we usually use bilinear upsampling (or rarely, nearest neighbor upsample). Then, we pass standard convolution with stride == 1, same as that of ConvTranspose2D method.
Note: Can we make skip connections between the encoder and the decoder?
I have additional questions about the revised picture.
Each layer in the decoder part is written as Upsample(a, b). I understood the corresponding a and b as c_in and c_out respectively. However, if the above reasoning is correct, each layer of the decoder in the picture does not match the dimension. Did I get this wrong or is it typo?
Also, I can infer that to restore z to x', we need to double the height and width in each resize process, is this correct?
@defchltldn
Wow, thank you for checking! I modified the decoder channel in the above issue.
Yepp, our Resize option (Frequently Bilinear or upsample operation) doubles our feature map region in this process.
I'm having difficulty understanding the feature map transformation during the upsampling process in the autoencoder architecture.
Specifically, in the provided diagram, it is unclear how the feature map changes after each upsampling block. For instance, after the first upsampling step, it's not clear how the transformation occurs from (64, 32) to the subsequent layers, particularly with respect to both the channel size and spatial dimensions.
Could you please provide more details on the feature map size transformation and the specific upsampling method used?
Thank you.