IntelLabs / distiller

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
Apache License 2.0
4.35k stars 802 forks source link

Thinning FC layers #73

Closed vinutah closed 4 years ago

vinutah commented 6 years ago

The thinning methods support only removing channels or filters of a CONV layer

        # We are only interested in 4D weights (of Convolution layers)
        if param.dim() != 4:
            continue

[1] How about thinning FC layers, even if you are not going to support it, can you provide, what all one should take care of if one wants to implement say remove_rows( ) or remove_columns( ) corresponding to neuron pruning ?

[2] Its seems hard to simply extend the thinning_recipe approach as it seems to be too tied to removing CONV structures. Any suggestions ?

[3] Also If we are thinning, pruned pytorch models, what could be the reason for its accuracy drop ? Because we are strictly removing only zero structures, the math should be about the same and cause the same classificaiton ? You seem to be taking into consideration a possible perofrmace drop by preparing to thin even the gradient tensors.

nzmora commented 6 years ago

Hi Vinu,

I was thinking how to answer you, so it took me some time to write back to you.
A proper answer would require some time. Basically, "thinning" is a process that transforms a network structure according to some "script" which I called the "thinning recipe".
Leclec et. al call this Neural Garbage Collection and describe this as:

Smallify decides on-the-fly which neurons to deactivate. Since Smallify deactivate a large fraction of neurons, we must dynamically remove these neurons at runtime to not unnecessarily impact network training time. We implemented a neural garbage collection method as part of our library which takes care of updating the necessary network layers as well as updating optimizer state to reflect the neuron removal.

Thinning in PyTorch is not trivial mainly because the structure (layers, connectivity and layer sizes) are all described in code (compare this to Caffe's protobuf format which is straight-forward to manipulate with code).
To deal with this I came up with the concept of a "thinning recipe", which is a set of instructions on how to manipulate a PyTorch module "graph" once it is instantiated. We start by creating a model from scratch. The modules of a network (model) are created in the constructor. For example, this is TorchVision's ResNet constructor:

https://github.com/pytorch/vision/blob/fb63374cf76a54cc4a5dde361f1966afacd36cad/torchvision/models/resnet.py#L157-L166

So in the example above, when an object of type ResNet is instantiated, all of the modules in the network (sometimes referred to as model) are instantiated and we can go and change their configuration. For example, we can go and change the number of input channels in a Conv2d object. We can also go and remove channels from the weights tensors by making the tensors phyiscally smaller. We do this by following instructions in a thinning recipe that we also stored in the checkpoint file. Now the model is ready to load the physically pruned weights (no zero filters or channels - they were physically removed and their smaller version was stored in the pickle checkpoint file). And then we are ready to go.

This was the second part of "thinning" - the part that "executes" thinning instructions. The first part is more challenging, and it is creating the thinning recipe. The challenge here is to understand the dependencies between the layers. For example, imagine a Batch Normalization layer that follows a Conv2d layer. If we reduce the number of output channels (and remove weight filters), we also need to understand that there's a BN layer following and that it also requires changes in its configuration (number of inputs) and in its parameters (mu and sigma tensors). And these dependencies can get quite complex (we also need to transform tensors that the Optimizer may have; and the gradients - see [3] below). But this is made even more challenging, because PyTorch doesn't creating a representation of the DNN graph in memory, until you run a forward pass (this gives PyTorch the flexibility to run a different path each forward-pass). Without an in-memory graph of the network data-dependency (connectivity), we can't analyze layer dependencies. To overcome this challenge I convert the PyTorch model to an ONNX model (which does a forward-pass) and then I "traverse" this ONNX object graph to construct a distiller.SummaryGraph object which hold an in-memory representation of the ONNX graph. A distiller.SummaryGraph is used in other situations besides thinning. For example to print the compute cost of each layer in the model.

Phew!

Let's catch our breath ;-). That's a lot of stuff going on and this was, believe it or not, a short and perhaps unsatisfying explanation of everything involved in "thinning". I've been planning on properly documenting this because the code is hard to follow.

Now to your questions ;-): [1] Thinning FC layers is important (e.g. in RNNs) and is in my queue. [2] I believe the "thinning" method is naturally extendible to FC (and other) layers. You need to dive into the details (perhaps once I document them; perhaps by reverse-engineering the code) to see this. [3] I'm not sure why we see a drop after thinning, which requires further fine-tuning. This assumption might need to be re-examined. The we transform the gradients for a different reason. Say we removed a filter from a weights tensor of a Conv2d layer. Well, we also need to remove a filter from the tensor storing the gradients of this weights tensor, other wise we can't perform any backward passes (the weights and the gradients tensors have the same shape, of course).

This was long but short.
Cheers Neta

vinutah commented 6 years ago

Thank you, Neta for your above description and answers.

I am getting used to your style of designing data structures to solve model compression problems. aka. "prepare-first-execute-next". There have been days where I felt some data structures have been overdesigned, but thinking more about all the dependencies, looks like we need such rich structures to hold the information necessary to solve the problem at hand.

Thanks to your in-code documentation and description above, I understand your approach to thinning. I like to think of your design as a graph algorithm, thinning requires graph traversal, so you obtained a proper, reliable and scalable graph data structure to operate on.

I have a few follow-up questions on certain low-level aspects of your design, I roughly see why you might have made these decisions, but clarifying this from you, I hope will benefit our community.

[1 ] Necessity of Sub recipes.

ThinningRecipe = namedtuple('ThinningRecipe', ['modules', 'parameters'])

Here is my understanding/notes for your choice of "modules" and why you might have considered having two sub-recipes for a ThinningRecipe.

[2] The case where Linear follows a Conv2d

This is a slightly simplified view of what you do when executing param_directives for the case of Linear followed by Conv2d. Could you please explain the logic that follows: if len(directive) == 4:

    for param_name, param_directives in recipe.parameters.items():
        param = model_find_param(model, param_name)
        assert param is not None
        for directive in param_directives:
            dim         = directive[0]
            indices     = directive[1]
            len_indices = indices.nelement()

            if len(directive) == 4:

                selection_view = param.view(*directive[2])

                if param.data.size(dim) != len_indices:
                    param.data = torch.index_select(selection_view, dim, indices)

                param.data = param.view(*directive[3])

[3] Handling of BN layer separately.

[4] Handling gradients

nzmora commented 6 years ago

Hi Vinu, I'm currently traveling so it will take me a few days to answer. Cheers Neta

vinutah commented 5 years ago

Sure Neta, No Problem.

[1] I was able to easily add support for Thinning FC layers. Using modules {} sub-recipe to "Reduce" dimensions and parameters {} sub-recipe to "Remove" neurons, via the append api's to access these recipies.

[2] My changes seems to work fine for small networks, I test it by SummaryGraph, Forward pass.

[3] I notice that the accuracy remains the same too, still have not seen a misclassification yet.

nzmora commented 5 years ago

Hi Vinu,

Sorry for taking such a long time - I was away in a conference and on vacation. Do you want to share your new code for thinning FC layers?

Cheers, Neta

vinutah commented 5 years ago

Hi Vinu,

Sorry for taking such a long time - I was away in a conference and on vacation. Do you want to share your new code for thinning FC layers?

Cheers, Neta

Sure Neta, I will do it this week.

levzlotnik commented 4 years ago

Hi,

Closing this due to staleness.