PiLab-CAU / ImageProcessing-2402

Image processing repo
MIT License
0 stars 1 forks source link

[Lecture2-2][1008] Comparison between transposed convolution and deconvolution #9

Open jleem99 opened 1 day ago

jleem99 commented 1 day ago

While studying autoencoder architecture, I discovered that the similar terms "transposed convolution" and "deconvolution" have caused some confusion. I would like to clarify their differences and explore their operation in a real-world context. Some of my explanations are based on my understanding, so if there are any mistakes, please feel free to correct me!

In the traditional sense where signal processing is involved, the term deconvolution refers to the mathematical process of reversing the effect of a convolution operation. However, finding the mathematical inverse of a convolution is challenging, and you need to know the original kernel and extra processes to mitigate the noise.

However, in a machine learning context, what is often referred to as "deconvolution" is actually transposed convolution, a practical solution for performing a deconvolution-like operation. Unlike the traditional deconvolution, you don't give it the original kernel. Instead, by the architectural design itself, the network learns a kernel that approximates an inverse of the original convolution during the encoding phase. While this isn't an accurate deconvolution in the strict mathematical sense, transposed convolution is designed to achieve similar goals where you reconstruct or upsample data.

To understand the reasoning behind the name "transposed" convolution, you must first consider how convolution operations can be represented using matrix multiplication. In CNNs, convolution can be unfolded into a matrix multiplication by converting the convolution kernel into a sparse matrix. When multiplied by the input, this matrix produces the convolved output. In the case of transposed convolution, we use the sparse matrix's transpose, which effectively reverses the spatial transformations during the convolution operation back to input-like expanded dimensions. This is why this operation is termed "transposed" convolution.

image *Figure from Islam, M. M. M., & Kim, J. (2019). Vision-Based autonomous crack detection of concrete structures using a fully convolutional Encoder–Decoder network. Sensors, 19(19), 4251.

I couldn't inspect the source code of transposed convolution because cuDNN is closed-source. However, I believe actual implementations of transposed convolution would resemble the lecture's explanation, where input data is spatially expanded by inserting zeros before going through the convolution operation.

yjyoo3312 commented 4 hours ago

@jleem99 Thanks for the clarification on the deconvolution:)

In the Computer Vision field, the term "deconvolution" stems from the paper in

Noh, Hyeonwoo, Seunghoon Hong, and Bohyung Han. "Learning deconvolution network for semantic segmentation." Proceedings of the IEEE international conference on computer vision. 2015.

As we discussed in class, they designed the convolutional operation for feature expansion by inserting zeros into the input feature maps. However, as you pointed out, their definition of deconvolution differs from the original concept.

In practice, computer vision researchers and popular libraries like PyTorch commonly refer to this operation as Conv2DTranspose. Nevertheless, there is still some controversy around the term. For instance, in an autoencoder, if Conv2D C is used in the encoder and Conv2DTranspose D in the decoder as its counterpart, the relationship C = Transpose(D) does not hold, assuming the convolution operation is expressed as a 2D matrix, as you mentioned. (Rather, we can find the similar design Restricted Boltzmann Machines and Deep Belief Networks with fully connected layers - not the scope of this course, but I will explain it if you are interested with that.)

Therefore, I believe that Conv2DTranspose is not the most accurate term for the operation, but I use it because it is widely accepted in the computer vision community. In our course (and generally in computer vision), Conv2DTranspose refers to the modified convolution introduced in class that expands the size of the input feature maps.