Group parameter - Githubissues

ardila commented 10 years ago

I can't find anywhere what the meaning of the "group" parameter is. It seems to halve the number of locations a filter gets applied when set to 2?

layers {
  name: "conv2"
  type: CONVOLUTION
  bottom: "norm1"
  top: "conv2"
  blobs_lr: 1
  blobs_lr: 2
  weight_decay: 1
  weight_decay: 0
  convolution_param {
    num_output: 256
    pad: 2
    kernel_size: 5
    group: 2
  }
}

Yangqing commented 10 years ago

It was there to implement the grouped convolution in Alex Krizhevsky's paper: when group=2, the first half of the filters are only connected to the first half of the input channels, and the second half only connected to the second half.

karpathy commented 10 years ago

I just noticed this for the first time because my filter dimensions were all wrong. I originally assumed it was some kind of speed optimization to batch up the matrix multiply. Is there any reason to use group != 1 ? I thought it was originally done for practical considerations because Alex couldn't fit the entire network on a single GPU. Is there any evidence that indicates that this also works better? Or is it much faster? I'm surprised to see it in the default ImageNet models, given that the grouping is non-standard and not generally seen in more recent papers.

shelhamer commented 10 years ago

To group seems like a historical accident from the practical constraints of the time, but I don't know of a side-by-side comparison of group and no-group models. It could be nice to settle the question on the axes of task performance, speed + memory, and optimization time + hassle.

We're working on a service to host more models -- a no-group default ImageNet contender could show up there.

Grouping isn't common in models after AlexNet as you noted:

The ZF net [1] and OverFeat [2] do not group.
Krizhevsky describes a no-group and no prediction normalization (sigmoid cross entropy loss vs. softmax loss) variation of AlexNet in One Weird Trick [3] but if I recall correctly it scores 2% points worse on top-1 validation.
The Return of the Devil in the Details [4] explores a family of model variations without grouping that all do just fine.

[1] http://arxiv.org/abs/1311.2901 [2] http://arxiv.org/abs/1312.6229 [3] http://arxiv.org/abs/1404.5997 [4] http://arxiv.org/abs/1405.3531

diPDew commented 9 years ago

Any follow-ups regarding the "group" parameter? As of now, the bvlc_reference_caffenet/train_val.prototxt still has group set, e.g. here.

Based on the above discussion, grouping will eventually be removed from Caffe?

dangweili commented 8 years ago

I print the bvlc_alexnet model's parameters, and I found that there are only half parameters in CONV2(48), CONV4(128) and CONV5(192). I think the parameters is should be twice. Is any problem here ?

kli-casia commented 8 years ago

Oh God, I just notice the group parameter today in AlexNet. Do other papers who use AlexNet also use the group parameter?

wk910930 commented 8 years ago

Group is helpful in my case where I need to do the channel-wise convolution.

myyan92 commented 8 years ago

Group is helpful in channel-wise deconvolution too. I think it just adds flexibility to layer setup and should not do harm. Just delete it in the proto if you don't need it.

Jongchan commented 8 years ago

@wk910930 @myyan92

I am trying to use the 'group' option to perform channel-wise convolution. Intermediate channels perform 3x3x64x64 convolution, where I want to use 'group: 64' option.

However, the gradient seems to explode, and the memory requirement goes up, and the training becomes slower. Is there any extra tips for using this option? I can't find any good example using the group option. Thank you guys (or anyone can give me an answer...) in advance :]

layer {
  name: "conv4"
  type: "Convolution"
  bottom: "conv3_1"
  top: "conv4"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 0.1
  }
  convolution_param {
    num_output: 64
    kernel_size: 3
    stride: 1
    pad: 1
        group:64
    weight_filler {
      type: "msra"
    }
    bias_filler {
      type: "constant"
    }
  }
}

layer {
  name: "relu4"
  type: "ReLU"
  bottom: "conv4"
  top: "conv4"
}
layer {
  name: "conv4_1"
  type: "Convolution"
  bottom: "conv4"
  top: "conv4_1"
  param {
    lr_mult: 1
  }
  param {
    lr_mult: 0.1
  }
  convolution_param {
    num_output: 64
    kernel_size: 1
    stride: 1
    pad: 0
    weight_filler {
      type: "msra"
    }
    bias_filler {
      type: "constant"
    }
  }
}

layer {
  name: "relu4_1"
  type: "ReLU"
  bottom: "conv4_1"
  top: "conv4_1"
}
layer {
        name: "sum4"
        type: "Eltwise"
        bottom: "conv4_1"
        bottom: "conv3_1"
        top: "conv4_1"
        eltwise_param {
          operation: 1
        }
}

wk910930 commented 8 years ago

In my case, I need to do convolution in the single feature map (or channel) rather than the whole input tensor, so the kernels won't view the other feature maps (or channels). And YES, it will increase memory usage and slow the training. I don't think gradient explode has anything to do with the use of 'group'. Maybe it comes from other parts of the network. Hope someone can share his/her experience.

Jongchan commented 8 years ago

@wk910930 Thank you for your answer.

Yes, after further trials, and with your kind comment, I am convinced that the exploding gradient has nothing to do with the 'group' parameter. In my case, I am implementing a network for super-resolution, which can easily have exploding gradients. I'm suspecting that the weight initialization part went wrong, but it's only an assumption.

With a proper gradient clipping & less layers, the network can converge, but gradients' L2 norm is much higher than the original network.

If I found out that this network-instability issue is related with the 'group' parameter, I will leave a comment later.

henzler commented 7 years ago

@wk910930 @Jongchan I also ecperience network instability when not using the group parameter in the Deconvolution layer. I have a network with convolution layer (down sampling) and Deconvolution (upsampling) for the Deconvolution layer I use group = num_output. If I do not use that my net does not converge.Any ideas why?

ethanhe42 commented 7 years ago

As @Jongchan said, the memory explode when use CUDNN convolution. I switched to CAFFE convolution, but it is too slow. I'm implementing Xception. The prototxt is available at https://github.com/yihui-he/Xception-caffe/blob/master/trainval.prototxt. Can anyone help?

mprat commented 7 years ago

@wk910930 If you use group=num_output_channels, then what you are really doing is learning a single shared per-channel filter to apply to all the channels, rather than learning num_output_channels separate channel-wise kernels.

see: https://github.com/BVLC/caffe/blob/master/src/caffe/layers/base_conv_layer.cpp#L167

myyan92 commented 7 years ago

@mprat I don't think this is correct. If you are right then there should only be one filter if num_input=num_output=group. In the code they allocated space for num_output filters

mprat commented 7 years ago

Hmm, you're right - good call.

prashnani commented 7 years ago

@dangweili : please see https://groups.google.com/forum/#!topic/caffe-users/ZkP84NcRv8I. Does that help?

QQQYang commented 7 years ago

@myyan92 @mprat If I want to train a single shared per-channel filter over all the channels, how to configure the prototxt file? Or should I create a new layer to achieve the operation ?

mprat commented 7 years ago

@QQQYang do you mean (1) a single per-channel filter that is applied to all channels, or (2) one 2D filter per channel? In (1) you need to create a new layer, as far I know. For (2), you need to do something like (this has none of the training parameters included for simplicity):

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "input"
  top: "conv1"
  convolution_param {
    num_output: 64
    kernel_size: 3
  }
}

layer {
  name: "conv2"
  type: "Convolution"
  bottom: "conv1"
  top: "conv2"
  convolution_param {
    num_output: 64
    kernel_size: 3
    group: 64
  }
}

conv1 is a "standard" 3D convolution with 64 output channels. This means the input blob to conv2 has 64 input channels. In conv2 you set group = num_output (in this case 64, which is also the number input channels), which means that there will be one 2D kernel applied to each of the 64 input channels to produce a single 64-channel output blob.

Hope that helps.

QQQYang commented 7 years ago

@mprat Thanks for your help. I want to implement the first one. I have finished the necessary forward and backward function, but I do not known how to add the weights of a single shared per-channel filter to the learnable parameters list to get updated during training. Could you give me some advice?

myyan92 commented 7 years ago

@QQQYang I think you can achieve the first by first reshaping the input blob to (NC)1HW, do convolution with output dim =1 and reshape back

QQQYang commented 7 years ago

@myyan92 It seems a good choice. I will try. Thanks.

QQQYang commented 6 years ago

@wk910930 @yihui-he if I set group=num_output in layer conv5_3 in VGG16, the memory keeps increasing and explodes finally when I fine tune the model. Otherwise, everything is normal. So I am sure it is the convolution_param "group" that caused the memory overload. Have you ever encountered this issue ? How to solve it ?

ethanhe42 commented 6 years ago

@QQQYang You'are right. I've rewrite the group=num_output case for conv layer. You can see my caffe fork: https://github.com/yihui-he/caffe-pro

slothkong commented 6 years ago

I just found out the "group " parameter is the only thing stopping me from pruning some model. I tried the naive thing and changed the value of group to 1, then loaded the network on pycaffe... it crashed. Does anybody know is there a work around this or I need to retrain the model from scratch?

mprat commented 6 years ago

Depends on how you're doing the pruning. The group parameter changes the size of your weight vector, so you can't just change it to 1 and expect it to work. Your options are:

Fix your pruning code to take grouped weights into account
Reconfigure the grouped weights into two conv layers followed by a concat layer
Retrain your network without grouped weights

slothkong commented 6 years ago

After a week well spent fighting with caffe I opted for the forth option: find another caffemodel haha. But thanks @mprat, the idea of using concat layers didn't even cross my mind.

ZhuweiQin commented 6 years ago

Hi, @slothkong , could please explain a little more about how you solve the "group " parameter for pruning? I just get stuck in this problem. Thank you~

BVLC / caffe

Group parameter #778