Pruner without thinning can not detect successor BN or Conv layers?

bezorro commented 5 years ago

Hi, thanks for providing this great DNN compression frame work. I am pruning MobileNetV1

class Block(nn.Module):
    def __init__(self, in_planes, out_planes, stride=1):
        super(Block, self).__init__()
        self.conv1 = nn.Conv2d(in_planes, in_planes, kernel_size=3, stride=stride, padding=1, groups=in_planes, bias=False)
        self.bn1 = nn.BatchNorm2d(in_planes)
        self.conv2 = nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=1, padding=0, bias=False)
        self.bn2 = nn.BatchNorm2d(out_planes)

    def forward(self, x):
        out = F.relu(self.bn1(self.conv1(x)))
        out = F.relu(self.bn2(self.conv2(out)))
        return out

With the YAML file mobilenet.schedule.yaml

version: 1
pruners:
    pruner_base_5:
    class: 'L1RankedStructureParameterPruner'
    group_type: Filters
    desired_sparsity: 0.8
    weights: [
              layers.3.conv2.weight,
              ]
policies:
  - pruner:
      instance_name : pruner_base_5
    epochs: [1]

I load a baseline model and train for several epochs and save checkpoint. Then I load the saved checkpoint to see what is in it. I find that the pruned layers.3.conv2.weight is pruned accordingly, only 0.2 of the filters are nonzero. But in the successor BN and Conv layers(layers.3.bn2.weight, layers.4.conv1.weight) are not pruned. All of the channels are nonzero. I 've read the codes about pruning. It seems that pruners can not detect successor BN or Conv layers and adjust their parameters accordingly before thinning. Is that means if I do not perform pruning, filter pruning cannot be performed correctly?

nzmora commented 5 years ago

Hi @bezorro,

You described the pruning behavior correctly: pruners create "direct" sparsity, but they don't actually change the network structure. Thinning is a process that can follow pruning, to perform "neural garbage collection" and physically remove structures and parameters from the network - based on the sparsity it sees in the network and the data-dependencies. I describe this a bit here and also in issue #73. This is the design choice, not a bug, but your question makes me wonder if zeroing dependent data (e.g. successor BN and Conv layers), w/o removing them, would help in pruning. Maybe it's worth trying.

Here's an example of adding an explicit "thinning" step (source):

extensions:
  net_thinner:
      class: 'FilterRemover'
      thinning_func_str: remove_filters
      arch: 'mobilenet'
      dataset: 'imagenet'

policies:
    # After completeing the pruning, we perform network thinning and continue fine-tuning.
  - extension:
      instance_name: net_thinner
    epochs: [2]

Cheers, Neta

bezorro commented 5 years ago

Hi @bezorro,

You described the pruning behavior correctly: pruners create "direct" sparsity, but they don't actually change the network structure. Thinning is a process that can follow pruning, to perform "neural garbage collection" and physically remove structures and parameters from the network - based on the sparsity it sees in the network and the data-dependencies. I describe this a bit here and also in issue #73. This is the design choice, not a bug, but your question makes me wonder if zeroing dependent data (e.g. successor BN and Conv layers), w/o removing them, would help in pruning. Maybe it's worth trying.

Here's an example of adding an explicit "thinning" step (source):
extensions:
  net_thinner:
      class: 'FilterRemover'
      thinning_func_str: remove_filters
      arch: 'mobilenet'
      dataset: 'imagenet'

policies:
    # After completeing the pruning, we perform network thinning and continue fine-tuning.
  - extension:
      instance_name: net_thinner
    epochs: [2]
Cheers, Neta

Thanks for answering. It helps me a lot. But as pruners can not mask successor BN and Conv layers. So now 2 problems occur.

The sensitivity analysis in this repo only apply mask to the current layer ignoring the successor BN and Conv. Thus, the sensitivity analysis results may be wrong.
In your example of "thinning" step(source), thinning is applied in epoch[212]. But 'fc pruner' and 'fine_pruner' end in epoch[230]. So the after epoch[212], the pruner is meaningless?

nzmora commented 5 years ago

Hi @bezorro,

Sorry for the late reply - I didn't see your reply.

You are correct about (1). I don't like sensitivity analysis that much because it treats the weights/filters as i.i.d. (i.e. SA ignores the inter-dependencies between layers) so after I wrote the "thinning" feature I didn't go back to update the SA code. But the concern you raise is valid, and especially in networks that have non-serial data-dependencies - where certain layers have inputs that are dependent on more than one layer (e.g. ResNet, DenseNet, etc). In such cases, if you remove a filter of a layer, you may need to change more than one dependent BN and Conv layers (e.g. in ResNet there are some long dependency chains that include 7-8 dependent convolutions). If this is not clear, I can try to send you a diagram.

Regarding (2): this is indeed looks like bug (but it is not :-) - good catch nonetheless!
It's nuanced, so I will explain in-depth:

fine_pruner is a fine-pruner (i.e. element-wise pruner), so it is not affected by "thinning" (because the sizes of the filters/channels remain the same when doign element-wise pruning, we can't make the network physically smaller).
fc_pruner prunes rows. The "thinning" feature does not support thinning of FC/Linear layers explicitly (as in this case, where we explicitly prune rows). I didn't write the code to explicitly "thinnify" FC layers, because they are not significant in today's CNNs (however, FC/Linear layers are important in other types of DNNs).

We do however, perform implicit thinning of FC layers. What do I mean by that? Look at the difference between this and this:

The size of the original (dense) layers is 640 weights.
In the first example: NNZ (dense) = 640 and NNZ (sparse) = 320. This means that the we have 640-320=320 zeros in the FC weights.
In the second example: NNZ (dense)= 320 and NNZ (sparse) = 160. This means that the we have 320-160=160 zeros in the FC weights. But NNZ (dense) changed from 640 to 320. Why? Because we (physically) removed 50% of the rows (row thinning). But I said above that we don't explicitly thinnify rows, so what happened? Well, in the 2nd example low_pruner_2 removes 50% of the filters of module.layer3.2.conv2.weight. Each of these filters corresponds to one row of the FC layers that follows (if it is not clear why, I can explain). So when we thinnify module.layer3.2.conv2.weight, we also implicitly thinnify the rows of module.fc.weight.

I hope this helped, Neta

bezorro commented 5 years ago

Hi @nzmora , I read and tested your codes for thinning and fully understand what you said. Thanks for your reply! It helps me a lot.

IntelLabs / distiller

Pruner without thinning can not detect successor BN or Conv layers? #174