[Lecture2-2][1004] Question about deconvolution

haeun0814 commented 5 days ago

In this lecture page, z is for "expanding" our image and p' , s' is for our "expanded" image. Transposed convolution is upsampling and convolution at the same time, and I want to know why it was written like this.

화면 캡처 2024-10-04 220533

What do the columns and rows of the matrix mean? Do a, b, and c mean the pixel values before upsampling? I am not sure how each column of the matrix corresponds to the dots in the figure...

Gamejoongsa commented 2 days ago

Transpose convolution performs a dimension expansion, differently than a normal convolution. As explained in the course materials, the goal is to re-expand an image whose dimensionality has been reduced by going through a convolution layer. To do this, we set a $z$ value using the original stride $s$, and create an empty space between pixels equal to the value of $z$. This process is referred to as “expanding”. After expanding the image with the $z$ value, we perform the normal convolution process. The padding and stride values, $p'$ and $s'$, are applied to the image that has been “expanded” by $z$ in the previous step. The reason we do this is so that the output image of the transpose convolution layer has the same dimensions as the input dimensions of the normal convolution layer. Since the transpose convolution reverts the output from a normal convolution to one that is scaled down by stride, these new parameters must be computed to ensure accurate upsampling through image scaling.
The variables $a$, $b$ and $c$ represent pixel values before upsampling in this illustration. The matrix beneath each diagram refers to the output values after applying upsampling and convolution. The rows and columns of the matrix correspond to the new pixel grid after the interpolation and convolution steps. And I don't think the illustration and the matrix correspond directly to each other; I think they are different examples of how to perform each convolution process: the illustration shows visually how the interpolated input values are spread out, while the matrix shows mathematically. For example, nearest neighbor interpolation takes the closest value and interpolates it, while bilinear interpolation takes half the value of two adjacent pixels and blends them together. The illustration shows how this interpolates into empty space, affecting neighboring features. The matrix, on the other hand, I think it partially shows what happens to the pixel values after the convolution process is performed on the interpolated image.

haeun0814 commented 2 days ago

Thank you for answering!! I have one more question. Nearest neighbor interpolation takes the closest value and interpolates it. I think in a pixel grid, the closest value is likely to be one of the values from the up, down, left, or right. And it may not result in a single value. In such cases, how is the closest value chosen?

yjyoo3312 commented 1 day ago

@haeun0814 @Gamejoongsa Thank you for the nice questions and answering:)

Expanding from z: As mentioned earlier, we use the value of z to expand the original image, rather than a kernel. This is done by inserting zeros into the image and then performing standard convolution with p' and s' set to 1. This process results in an expanded output feature map for the input.
Yes, thank you for the clarification: In this context, a, b, and c refer to the feature (or pixel) values before upsampling (in 1D). For more detailed information, you can refer to the following link: https://distill.pub/2016/deconv-checkerboard/.
Regarding closest value selection: As you noted, this is a design choice. In the below example, they select the nearest value from those equidistant from the reference, defaulting to the leftmost value in the case of a tie.

PiLab-CAU / ImageProcessing-2402

[Lecture2-2][1004] Question about deconvolution #7