Add support to Conv layers

ArthurWalraven commented 2 years ago

At this moment, we only support Dense layers. This is becuase SparseArrays.jl currently supports only 1D vectors or 2D matrices, so we could try transforming Conv layers into dense matrix multiplications. One example is to use "im2col" conversion.

ArthurWalraven commented 1 year ago

I believe that the "im2col" in NNlib.jl converts the input tensor into a large matrix, but we would like to make the changes in the kernel instead.

damrvt commented 1 year ago

Understanding the im2col! function is already a first objective I think.

The objective would be to have a function which takes as input an array of size 3 (corresponding to an "image") and a tensor (corresponding to the conv layer concerned) and which returns this so-called "large matrix". This requires in particular a good understanding of the ConvDims structure (from NNlib), which is the object that "stores" the dimensional information associated with a conv layer and the type of input (size, padding, channels etc).

It's quite annoying actually. I'll open an issue about it soon if I keep on struggling.

natema commented 1 year ago

Do not hesitate to ask questions in the ML channel of the Julia Slack.

damrvt commented 1 year ago

I believe that the "im2col" in NNlib;jl converts the input tensor into a large matrix, but we would like to make the changes in the kernel instead.

I was finally able to use the im2col! function as I wanted. Btw it seems that for a convolutionnal layer with associated tensor W, the associated "large matrix" M can actually be obtained with the reshape function.

Now my current idea about the following structure of the CNN -> DenseNN we want is the following: Each convolutionnal layer would be subsituted by Chain(im2col, Dense(M), col2im).

ArthurWalraven commented 1 year ago

Each convolutionnal layer would be subsituted by Chain(im2col, Dense(M), col2im).

I'm not entirely sure what this means, but it looks wrong. We want something similar, so I'll be precise.

We're looking for a function conv2dense(l::Conv)::Dense such that Chain(quicktransform1, conv2dense(l), quicktransform2)(X) == l(X). The important difference is that the auxiliary functions (quicktransform1 and quicktransform2) should be simple and quick (probably reshapes), while conv2dense is more complicated and slow.

I was finally able to use the im2col! function as I wanted. Btw it seems that for a convolutionnal layer with associated tensor W, the associated "large matrix" M can actually be obtained with the reshape function.

Here you seem to describe the opposite, which is what Flux.Conv already implements. It's easy to make this confusion. In fact, I was trying to prevent it with my previous message:

I believe that the "im2col" in NNlib;jl converts the input tensor into a large matrix, but we would like to make the changes in the kernel instead.

damrvt commented 1 year ago

We're looking for a function conv2dense(l::Conv)::Dense such that Chain(quicktransform1, conv2dense(l), quicktransform2)(X) == l(X). The important difference is that the auxiliary functions (quicktransform1 and quicktransform2) should be simple and quick (probably reshapes), while conv2dense is more complicated and slow.

Yes I know what you mean, that's the direction we went last time. But actually reconsidering the problem I didn't see why im2col would be a slower operation than reshape. Yet as it is designed, it seems to be. I'm confused.

ArthurWalraven commented 1 year ago

Yes I know what you mean, that's the direction we went last time. But actually reconsidering the problem I didn't see why im2col would be a slower operation than reshape. Yet as it is designed, it seems to be. I'm confused.

We discussed it on a call and the confusion was addressed. Long story short,

In general, im2col increases the size of the input dramatically, adding many 0s and copies of the data;
Doing a similar operation to the kernel instead would can speed up inference, but would break training. This is not an issue for us, though, since we only intend to do the true sparsification once all training is done.

damrvt commented 1 year ago

After consideration, I think that the im2col function will not help us. Indeed we had the impression that im2col was the bridge between the convolution operation and a FC layer but it is not really the case. With im2col we get a matrix col and the convolution is computed col * M (where M is a resphape version of the tensor). Now we would like something of the form M'*v where v is a vector (associated to the input) and not a matrix. I thought that these two approaches were very similar but on closer inspection I don't see how to go from one to the other. Also, after thinking about it I think that converting im2col to a matrix operation is not particularly practical. In conclusion, I think that following the usual convolution pipeline is not a good direction, especially taking into account what you summarized in @ArthurWalraven 's last message.

The direct approach seems to me finally more reasonable. As a reminder, by direct approach I mean to consider the convolution as a linear application $L : \mathbb{R}^{h_1\times w_1 \times c_1}\rightarrow \mathbb{R}^{h_2\times w_2 \times c_2}$ and to obtain the associated matrix by applying $L$ to the vectors of the canonical basis of $\mathbb{R}^{h_1\times w_1 \times c_1}$.

This method has a "brute force" side but has the merit of being simple to implement. We could then think about reducing the complexity of this operation by taking into account the different symmetries of the convolution (here again, parameters such as padding complicate things).

Finally, there is also an alternative. I don't know if you have already seen this paper (An Equivalence of Fully Connected Layer and Convolutional Layer) https://arxiv.org/pdf/1712.01252.pdf In it there is an explicit conversion algorithm. I haven't looked at it in detail yet but it doesn't look that complicated.

ArthurWalraven commented 1 year ago

With im2col we get a matrix col and the convolution is computed col * M (where M is a resphape version of the tensor). Now we would like something of the form M'*v where v is a vector (associated to the input) and not a matrix.

"the tensor" is ambiguous. So, if I got it right,

M is, up to reshapings, the tensor associated to the convolution's kernel;
M' = conv2dense(M) (a matrix) like I defined above;
col = im2col(x), where x is an input tensor to the convolutional layer.

Is that correct?

ArthurWalraven commented 1 year ago

The direct approach seems to me finally more reasonable. As a reminder, by direct approach I mean to consider the convolution as a linear application ...

We discussed this approach months ago. Any (correct) direction will produce the same output since there should be a unique matrix for a linear transformation. We don't care much about efficiency of finding this matrix because it only needs to be done once at the end of the pruning.

ArthurWalraven commented 1 year ago

This method has a "brute force" side but has the merit of being simple to implement. We could then think about reducing the complexity of this operation by taking into account the different symmetries of the convolution (here again, parameters such as padding complicate things).

"Simple to implement" is the only way to go. The point of the issue is that we cannot write conv2dense explicitly. So, we could obtain it from clever use of standard function (im2col and col2im), or, if this proves to be too hard, we go by transforming the canonical basis, as you said.

damrvt commented 1 year ago

Finally, there is also an alternative. I don't know if you have already seen this paper (An Equivalence of Fully Connected Layer and Convolutional Layer) https://arxiv.org/pdf/1712.01252.pdf

After a more careful reading I realized that this paper is missleading, what they propose in the end seems to be very close to the im2col approach... We can forget it.

ArthurWalraven commented 1 year ago

Finally, there is also an alternative. I don't know if you have already seen this paper (An Equivalence of Fully Connected Layer and Convolutional Layer) https://arxiv.org/pdf/1712.01252.pdf

After a more careful reading I realized that this paper is missleading, what they propose in the end seems to be very close to the im2col approach... We can forget it.

It is. My previous message was wrong. I've deleted it.

damrvt commented 1 year ago

There seems to be a major problem with this conv2dense approach. First, we notice the following: Given a conv layer, the number of associated parameters is (very) approximately of the same order of magnitude as the size of the input. However, this leads to a matrix of size (input size)*(outputsize).

To give some examples:

The first layer of LeNet5: input size 28x28x1, output size 28x28x6. This leads to a matrix of size 3,687,936 (by comparison the network itself contains a total of 60,000 parameters).
In a classical network like ResNet, VGG etc., this leads to matrices of the order of 10^12 parameters.

It turns out that this argument is not quite sufficient to show that there is an impossibility, because since we are using SparseArray what counts in the end is the number of non-zero entries in the matrix. Here is a vulgar and fast estimation. Let's go back to the case of a conv layer $L : \mathbb{R}^{h_1\times w_1 \times c_1}\rightarrow \mathbb{R}^{h_2\times w_2 \times c2}$, with tensor $T$. Consider a vector of the canonical basis $e{i,j,k} \in \mathbb{R}^{h_1\times w_1 \times c1}$. Given a kernel $K$, each non-zero entry of $K$ should in the convolution process meet $e{i,j,k}$ once and only once, resulting in a non-zero entry at some position in the final matrix. This gives in total: (number of non-zero entries in T)*(size of the input) non-zero entries. Obviously it is likely that some of these positions will sometimes coincide. This is for example obviously the case when the size of the output is small compared to the size of the input. However, in general I have the impression that there will not be much overlap. A precise estimate would require more explicit formulas, but my guess is that (when the output has approximatively the same size as the input) it should be of this order. I will continue to try to make this reasoning more rigorous to confirm or not that there is a problem here but in any case I wanted to bring this observation to your attention because I am starting to get the impression that the Conv and FC layers are in fact incompatible.

ArthurWalraven commented 1 year ago

I'll re-write you message while parsing it. This makes it easier for me to make sure I got everything.

There seems to be a major problem with this conv2dense approach. Since conv2dense(ℓ::Conv) is a linear map between the same input and output sapces as ℓ, it's associated matrix has size length(input) × length(output). This can be huge. For example

The first layer of LeNet5 woud yield a matrix of length 28×28×1 * 28×28×6 = 3,687,936. For comparison, LetNet5 itself has 60,000 parameters.

Classical networks like ResNet and VGG would lead to matrices of the order of 10^12 parameters. 

This is not necessarily an issue since most of those parameters are 0s and would be ignored by SparseArrays.jl. However, some of the parameter overhead comes from extra copies of the weights in the original kernel.

Then you proceed to try to estimate the number of non-zeroes in conv2dense(ℓ) as a function of size(ℓ.weights).

You can assume ℓ.weights to be dense (all non-zeroes) to get an upper-bound. In this case, it seems quite believable that the number of non-zero parameters in conv2dense(ℓ) is of the order of length(input) * length(ℓ.weights) (but you should check). Isn't this the size of im2col(input)? If so, we don't really have a problem.

ArthurWalraven commented 1 year ago

I'm trying to refresh my mind on this matter, so I'll mention why I thought leveraging im2col could be the way to go. Then you can tell me where the reasoning breaks.

Let $$ denote the convolution operation and $\cdot$ the matrix product. Consider a single input tensor $X$ and a single convolution kernel $K$. We have that $$K X = \mathrm{col2im}(\mathrm{im2col}(X) \cdot \mathrm{vec}(K)).$$ Using that $K X = X K$, it holds that $$K * X = \mathrm{col2im}(\mathrm{im2col}(K) \cdot \mathrm{vec}(X)).$$

Moreover, $\mathrm{im2col}$ is complicated to implement and slow to compute, while $\mathrm{col2im}$ is reasonably simple and fast.

My guess is that trouble arrives when considering the actual implementation of the convolution operation.

It may not even be a convolution. For example, it's quite common to for those libs to not flip the kernel before sliding it over the input. The resulting operation (cross correlation) is not commutative;
Commutativity comes from assuming infinite padding (convolution of functions/sequences). Obtaining it in practice would require padding the kernel (by a lot when it's much smaller than the input, which is the case of interest). This appears to be the "theoretical" source of the huge amount of zeros we see in the other analyses.

I don't see where the theory fails, so, until now, I believe that the only actual issue is that $\mathrm{im2col}$ may yield an inpractically large matrix. Thus, we would need to build such matrix already as a sparse one. Do you confirm, @damrvt?

Also, are you able to perform a convolution using NNlib.im2col and NNlib.col2im? I'd like to try all of this in small experiments.

damrvt commented 1 year ago

After looking at all this closely it seems to me that everything you say is true. I tried to estimate the memory footprint of both sides and it is clear that it is smaller for the second operation than for the first. Taking K of dimension (n,m) and X of dimension (h,w). We have

size(im2col(X)) ~ (nm, hw) (and im2col(X) is dense).
im2col(K) has nm * hw * (sparsity rate of K) non zero entries. So here we have a factor sparsity rate of K between both, cleary advantageous when Kis sparse.

However, I write this answer so that we keep in mind that the symmetry you presented is broken when considering multiple filters. I was curious to estimate the memory footprint in this case and I present here for information the calculations I did (I do not draw any conclusion on what is the good direction in the end).

We consider a tensor T of dimensions (n,m,p,q) and an input X of dimensions (h,w,c). We have

size(im2col(X)) ~ (nm, hwc) .
im2col(K) has nm * hw * pq * (sparsity rate of K) non zero entries.  Here we see that both differ by a factor pq * (sparsity rate of K).

It seems that in general this factor is greater than 1. But now you have to take into account that on the one hand, im2col(T) is computed before the use of the CNN and only "read" whereas im2col(X) is written in memory at each inference of the CNN (and I don't master enough the technical details to see which of the two operations presents the most interesting compromise).

ArthurWalraven commented 1 year ago

After looking at all this closely it seems to me that everything you say is true. I tried to estimate the memory footprint of both sides and it is clear that it is smaller for the second operation than for the first.

I think it was supposed to be some sort of $\mathrm{reshape}$ where I put $\mathrm{col2im}$.

We consider a tensor T of dimensions (n,m,p,q) and an input X of dimensions (h,w,c).

Reasoning in terms of 2D convolutions, we should have p == c. This hypothesis simplifies the subsequent analysis to "we will win if the density of K is less than 1/q, where q is number of filters in K. We also need to take into account that, when sparsifying the kernel, at some point you should end up with empty filters, which will make disappear the big block of columns of im2col(K) associated to that filter (nm * hw * p non-zero entries according to your estimations). Finally, im2col(K) is highly redundant. Exploting that would drop the number of non-zeros by a factor of hw! It seems feasible to do so and that would be very powerful.

Even with those remarks, however, this direction seems too demanding/risky, so I'd recommend for us to drop it. The lessons from this discussion indicate a simpler alternative:

Define a custom multidimensional array type that behaves like Julia's `Array` except for reshapes into `Vector` and `Matrix`, that should (in adition to the reshape) convert the result into a `SparseVector` or `SparseMatrix`, respectively.

The idea is that Conv layers that have their weights in this new type of array should work normally with Flux.jl, since the convolution operation will apply a reshape to the special array (and mul(::Matrix, ::SparseMatrix) is already defined).

I'll further explain this in the future.

AInnervate / TinyNets.jl