Open ArthurWalraven opened 2 years ago
I believe that the "im2col" in NNlib.jl
converts the input tensor into a large matrix, but we would like to make the changes in the kernel instead.
Understanding the im2col! function is already a first objective I think.
The objective would be to have a function which takes as input an array of size 3 (corresponding to an "image") and a tensor (corresponding to the conv layer concerned) and which returns this so-called "large matrix". This requires in particular a good understanding of the ConvDims structure (from NNlib), which is the object that "stores" the dimensional information associated with a conv layer and the type of input (size, padding, channels etc).
It's quite annoying actually. I'll open an issue about it soon if I keep on struggling.
Do not hesitate to ask questions in the ML channel of the Julia Slack.
I believe that the "im2col" in
NNlib;jl
converts the input tensor into a large matrix, but we would like to make the changes in the kernel instead.
I was finally able to use the im2col!
function as I wanted. Btw it seems that for a convolutionnal layer with associated tensor W
, the associated "large matrix" M
can actually be obtained with the reshape function.
Now my current idea about the following structure of the CNN -> DenseNN
we want is the following:
Each convolutionnal layer would be subsituted by Chain(im2col, Dense(M), col2im)
.
Each convolutionnal layer would be subsituted by
Chain(im2col, Dense(M), col2im)
.
I'm not entirely sure what this means, but it looks wrong. We want something similar, so I'll be precise.
We're looking for a function conv2dense(l::Conv)::Dense
such that Chain(quicktransform1, conv2dense(l), quicktransform2)(X) == l(X)
. The important difference is that the auxiliary functions (quicktransform1
and quicktransform2
) should be simple and quick (probably reshapes), while conv2dense
is more complicated and slow.
I was finally able to use the
im2col!
function as I wanted. Btw it seems that for a convolutionnal layer with associated tensorW
, the associated "large matrix"M
can actually be obtained with the reshape function.
Here you seem to describe the opposite, which is what Flux.Conv
already implements.
It's easy to make this confusion. In fact, I was trying to prevent it with my previous message:
I believe that the "im2col" in
NNlib;jl
converts the input tensor into a large matrix, but we would like to make the changes in the kernel instead.
We're looking for a function
conv2dense(l::Conv)::Dense
such thatChain(quicktransform1, conv2dense(l), quicktransform2)(X) == l(X)
. The important difference is that the auxiliary functions (quicktransform1
andquicktransform2
) should be simple and quick (probably reshapes), whileconv2dense
is more complicated and slow.
Yes I know what you mean, that's the direction we went last time. But actually reconsidering the problem I didn't see why im2col would be a slower operation than reshape. Yet as it is designed, it seems to be. I'm confused.
Yes I know what you mean, that's the direction we went last time. But actually reconsidering the problem I didn't see why im2col would be a slower operation than reshape. Yet as it is designed, it seems to be. I'm confused.
We discussed it on a call and the confusion was addressed. Long story short,
im2col
increases the size of the input dramatically, adding many 0s and copies of the data;After consideration, I think that the im2col function will not help us. Indeed we had the impression that im2col was the bridge between the convolution operation and a FC layer but it is not really the case. With im2col we get a matrix col
and the convolution is computed
col * M
(where M
is a resphape version of the tensor).
Now we would like something of the form
M'*v
where v
is a vector (associated to the input) and not a matrix. I thought that these two approaches were very similar but on closer inspection I don't see how to go from one to the other. Also, after thinking about it I think that converting im2col to a matrix operation is not particularly practical. In conclusion, I think that following the usual convolution pipeline is not a good direction, especially taking into account what you summarized in @ArthurWalraven 's last message.
The direct approach seems to me finally more reasonable. As a reminder, by direct approach I mean to consider the convolution as a linear application $L : \mathbb{R}^{h_1\times w_1 \times c_1}\rightarrow \mathbb{R}^{h_2\times w_2 \times c_2}$ and to obtain the associated matrix by applying $L$ to the vectors of the canonical basis of $\mathbb{R}^{h_1\times w_1 \times c_1}$.
This method has a "brute force" side but has the merit of being simple to implement. We could then think about reducing the complexity of this operation by taking into account the different symmetries of the convolution (here again, parameters such as padding complicate things).
Finally, there is also an alternative. I don't know if you have already seen this paper (An Equivalence of Fully Connected Layer and Convolutional Layer) https://arxiv.org/pdf/1712.01252.pdf In it there is an explicit conversion algorithm. I haven't looked at it in detail yet but it doesn't look that complicated.
With im2col we get a matrix
col
and the convolution is computedcol * M
(whereM
is a resphape version of the tensor). Now we would like something of the formM'*v
wherev
is a vector (associated to the input) and not a matrix.
"the tensor" is ambiguous. So, if I got it right,
M
is, up to reshapings, the tensor associated to the convolution's kernel;M' = conv2dense(M)
(a matrix) like I defined above;col = im2col(x)
, where x
is an input tensor to the convolutional layer.Is that correct?
The direct approach seems to me finally more reasonable. As a reminder, by direct approach I mean to consider the convolution as a linear application ...
We discussed this approach months ago. Any (correct) direction will produce the same output since there should be a unique matrix for a linear transformation. We don't care much about efficiency of finding this matrix because it only needs to be done once at the end of the pruning.
This method has a "brute force" side but has the merit of being simple to implement. We could then think about reducing the complexity of this operation by taking into account the different symmetries of the convolution (here again, parameters such as padding complicate things).
"Simple to implement" is the only way to go. The point of the issue is that we cannot write conv2dense
explicitly. So, we could obtain it from clever use of standard function (im2col
and col2im
), or, if this proves to be too hard, we go by transforming the canonical basis, as you said.
Finally, there is also an alternative. I don't know if you have already seen this paper (An Equivalence of Fully Connected Layer and Convolutional Layer) https://arxiv.org/pdf/1712.01252.pdf
After a more careful reading I realized that this paper is missleading, what they propose in the end seems to be very close to the im2col approach... We can forget it.
Finally, there is also an alternative. I don't know if you have already seen this paper (An Equivalence of Fully Connected Layer and Convolutional Layer) https://arxiv.org/pdf/1712.01252.pdf
After a more careful reading I realized that this paper is missleading, what they propose in the end seems to be very close to the im2col approach... We can forget it.
It is. My previous message was wrong. I've deleted it.
There seems to be a major problem with this conv2dense approach. First, we notice the following: Given a conv layer, the number of associated parameters is (very) approximately of the same order of magnitude as the size of the input. However, this leads to a matrix of size (input size)*(outputsize)
.
To give some examples:
It turns out that this argument is not quite sufficient to show that there is an impossibility, because since we are using SparseArray what counts in the end is the number of non-zero entries in the matrix. Here is a vulgar and fast estimation. Let's go back to the case of a conv layer $L : \mathbb{R}^{h_1\times w_1 \times c_1}\rightarrow \mathbb{R}^{h_2\times w_2 \times c2}$, with tensor $T$.
Consider a vector of the canonical basis $e{i,j,k} \in \mathbb{R}^{h_1\times w_1 \times c1}$. Given a kernel $K$, each non-zero entry of $K$ should in the convolution process meet $e{i,j,k}$ once and only once, resulting in a non-zero entry at some position in the final matrix.
This gives in total: (number of non-zero entries in T)*(size of the input)
non-zero entries.
Obviously it is likely that some of these positions will sometimes coincide. This is for example obviously the case when the size of the output is small compared to the size of the input. However, in general I have the impression that there will not be much overlap. A precise estimate would require more explicit formulas, but my guess is that (when the output has approximatively the same size as the input) it should be of this order.
I will continue to try to make this reasoning more rigorous to confirm or not that there is a problem here but in any case I wanted to bring this observation to your attention because I am starting to get the impression that the Conv and FC layers are in fact incompatible.
I'll re-write you message while parsing it. This makes it easier for me to make sure I got everything.
There seems to be a major problem with this
conv2dense
approach. Sinceconv2dense(ℓ::Conv)
is a linear map between the same input and output sapces asℓ
, it's associated matrix has sizelength(input) × length(output)
. This can be huge. For example
- The first layer of LeNet5 woud yield a matrix of length
28×28×1 * 28×28×6 = 3,687,936
. For comparison, LetNet5 itself has 60,000 parameters.- Classical networks like ResNet and VGG would lead to matrices of the order of 10^12 parameters.
This is not necessarily an issue since most of those parameters are 0s and would be ignored by
SparseArrays.jl
. However, some of the parameter overhead comes from extra copies of the weights in the original kernel.
Then you proceed to try to estimate the number of non-zeroes in conv2dense(ℓ)
as a function of size(ℓ.weights)
.
You can assume ℓ.weights
to be dense (all non-zeroes) to get an upper-bound. In this case, it seems quite believable that the number of non-zero parameters in conv2dense(ℓ)
is of the order of length(input) * length(ℓ.weights)
(but you should check).
Isn't this the size of im2col(input)
? If so, we don't really have a problem.
I'm trying to refresh my mind on this matter, so I'll mention why I thought leveraging im2col
could be the way to go. Then you can tell me where the reasoning breaks.
Let $$ denote the convolution operation and $\cdot$ the matrix product. Consider a single input tensor $X$ and a single convolution kernel $K$. We have that $$K X = \mathrm{col2im}(\mathrm{im2col}(X) \cdot \mathrm{vec}(K)).$$ Using that $K X = X K$, it holds that $$K * X = \mathrm{col2im}(\mathrm{im2col}(K) \cdot \mathrm{vec}(X)).$$
Moreover, $\mathrm{im2col}$ is complicated to implement and slow to compute, while $\mathrm{col2im}$ is reasonably simple and fast.
My guess is that trouble arrives when considering the actual implementation of the convolution operation.
I don't see where the theory fails, so, until now, I believe that the only actual issue is that $\mathrm{im2col}$ may yield an inpractically large matrix. Thus, we would need to build such matrix already as a sparse one. Do you confirm, @damrvt?
Also, are you able to perform a convolution using NNlib.im2col
and NNlib.col2im
? I'd like to try all of this in small experiments.
After looking at all this closely it seems to me that everything you say is true. I tried to estimate the memory footprint of both sides and it is clear that it is smaller for the second operation than for the first.
Taking K
of dimension (n,m)
and X
of dimension (h,w)
. We have
size(im2col(X)) ~ (nm, hw)
(and im2col(X)
is dense).
im2col(K)
has nm * hw * (sparsity rate of K)
non zero entries.
So here we have a factor sparsity rate of K
between both, cleary advantageous when K
is sparse.
However, I write this answer so that we keep in mind that the symmetry you presented is broken when considering multiple filters. I was curious to estimate the memory footprint in this case and I present here for information the calculations I did (I do not draw any conclusion on what is the good direction in the end).
We consider a tensor T
of dimensions (n,m,p,q)
and an input X
of dimensions (h,w,c)
. We have
size(im2col(X)) ~ (nm, hwc)
.im2col(K)
has nm * hw * pq * (sparsity rate of K)
non zero entries.
Here we see that both differ by a factor pq * (sparsity rate of K)
.It seems that in general this factor is greater than 1. But now you have to take into account that on the one hand, im2col(T)
is computed before the use of the CNN and only "read" whereas im2col(X)
is written in memory at each inference of the CNN (and I don't master enough the technical details to see which of the two operations presents the most interesting compromise).
After looking at all this closely it seems to me that everything you say is true. I tried to estimate the memory footprint of both sides and it is clear that it is smaller for the second operation than for the first.
I think it was supposed to be some sort of $\mathrm{reshape}$ where I put $\mathrm{col2im}$.
We consider a tensor
T
of dimensions(n,m,p,q)
and an inputX
of dimensions(h,w,c)
.
Reasoning in terms of 2D convolutions, we should have p == c
.
This hypothesis simplifies the subsequent analysis to "we will win if the density of K
is less than 1/q
, where q
is number of filters in K
. We also need to take into account that, when sparsifying the kernel, at some point you should end up with empty filters, which will make disappear the big block of columns of im2col(K)
associated to that filter (nm * hw * p
non-zero entries according to your estimations).
Finally, im2col(K)
is highly redundant. Exploting that would drop the number of non-zeros by a factor of hw
! It seems feasible to do so and that would be very powerful.
Even with those remarks, however, this direction seems too demanding/risky, so I'd recommend for us to drop it. The lessons from this discussion indicate a simpler alternative:
Array
except for reshapes into Vector
and Matrix
, that should (in adition to the reshape) convert the result into a SparseVector
or SparseMatrix
, respectively.The idea is that Conv
layers that have their weights in this new type of array should work normally with Flux.jl
, since the convolution operation will apply a reshape
to the special array (and mul(::Matrix, ::SparseMatrix)
is already defined).
I'll further explain this in the future.
At this moment, we only support
Dense
layers. This is becuaseSparseArrays.jl
currently supports only 1D vectors or 2D matrices, so we could try transformingConv
layers into dense matrix multiplications. One example is to use "im2col" conversion.