Closed vinutah closed 4 years ago
Hi Vinu,
I was thinking how to answer you, so it took me some time to write back to you.
A proper answer would require some time. Basically, "thinning" is a process that transforms a network structure according to some "script" which I called the "thinning recipe".
Leclec et. al call this Neural Garbage Collection and describe this as:
Smallify decides on-the-fly which neurons to deactivate. Since Smallify deactivate a large fraction of neurons, we must dynamically remove these neurons at runtime to not unnecessarily impact network training time. We implemented a neural garbage collection method as part of our library which takes care of updating the necessary network layers as well as updating optimizer state to reflect the neuron removal.
Thinning in PyTorch is not trivial mainly because the structure (layers, connectivity and layer sizes) are all described in code (compare this to Caffe's protobuf format which is straight-forward to manipulate with code).
To deal with this I came up with the concept of a "thinning recipe", which is a set of instructions on how to manipulate a PyTorch module "graph" once it is instantiated. We start by creating a model from scratch. The modules of a network (model) are created in the constructor. For example, this is TorchVision's ResNet constructor:
So in the example above, when an object of type ResNet
is instantiated, all of the modules in the network (sometimes referred to as model) are instantiated and we can go and change their configuration. For example, we can go and change the number of input channels in a Conv2d object. We can also go and remove channels from the weights tensors by making the tensors phyiscally smaller. We do this by following instructions in a thinning recipe that we also stored in the checkpoint file.
Now the model is ready to load the physically pruned weights (no zero filters or channels - they were physically removed and their smaller version was stored in the pickle checkpoint file).
And then we are ready to go.
This was the second part of "thinning" - the part that "executes" thinning instructions. The first part is more challenging, and it is creating the thinning recipe. The challenge here is to understand the dependencies between the layers. For example, imagine a Batch Normalization layer that follows a Conv2d layer. If we reduce the number of output channels (and remove weight filters), we also need to understand that there's a BN layer following and that it also requires changes in its configuration (number of inputs) and in its parameters (mu and sigma tensors). And these dependencies can get quite complex (we also need to transform tensors that the Optimizer may have; and the gradients - see [3] below).
But this is made even more challenging, because PyTorch doesn't creating a representation of the DNN graph in memory, until you run a forward pass (this gives PyTorch the flexibility to run a different path each forward-pass). Without an in-memory graph of the network data-dependency (connectivity), we can't analyze layer dependencies. To overcome this challenge I convert the PyTorch model to an ONNX model (which does a forward-pass) and then I "traverse" this ONNX object graph to construct a distiller.SummaryGraph
object which hold an in-memory representation of the ONNX graph. A distiller.SummaryGraph
is used in other situations besides thinning. For example to print the compute cost of each layer in the model.
Phew!
Let's catch our breath ;-). That's a lot of stuff going on and this was, believe it or not, a short and perhaps unsatisfying explanation of everything involved in "thinning". I've been planning on properly documenting this because the code is hard to follow.
Now to your questions ;-): [1] Thinning FC layers is important (e.g. in RNNs) and is in my queue. [2] I believe the "thinning" method is naturally extendible to FC (and other) layers. You need to dive into the details (perhaps once I document them; perhaps by reverse-engineering the code) to see this. [3] I'm not sure why we see a drop after thinning, which requires further fine-tuning. This assumption might need to be re-examined. The we transform the gradients for a different reason. Say we removed a filter from a weights tensor of a Conv2d layer. Well, we also need to remove a filter from the tensor storing the gradients of this weights tensor, other wise we can't perform any backward passes (the weights and the gradients tensors have the same shape, of course).
This was long but short.
Cheers
Neta
Thank you, Neta for your above description and answers.
I am getting used to your style of designing data structures to solve model compression problems. aka. "prepare-first-execute-next". There have been days where I felt some data structures have been overdesigned, but thinking more about all the dependencies, looks like we need such rich structures to hold the information necessary to solve the problem at hand.
Thanks to your in-code documentation and description above, I understand your approach to thinning
. I like to think of your design as a graph algorithm, thinning requires graph traversal, so you obtained a proper, reliable and scalable graph data structure to operate on.
I have a few follow-up questions on certain low-level aspects of your design, I roughly see why you might have made these decisions, but clarifying this from you, I hope will benefit our community.
[1 ] Necessity of Sub recipes.
ThinningRecipe = namedtuple('ThinningRecipe', ['modules', 'parameters'])
Here is my understanding/notes for your choice of "modules" and why you might have considered having two sub-recipes for a ThinningRecipe.
Regarding the choice of the word "module".
This is probably my expression of confusion between PyTorch module, parameters, layers.. they all really different names for the modes per layer parameters correct?
So PyTorch model is simply a set of model parameters, and each parameter has a name. You have kept "PyTorch module" changes in the "module" sub-recipe and "PyTorch parameter" changes in the "parameter" sub-recipe.
Everytime param_directives are updated, modules will also get updated to reflect the changes made to the parameters.
modules help you handle special layers like BN.
[2] The case where Linear follows a Conv2d
This is a slightly simplified view of what you do when executing param_directives for the case of Linear followed by Conv2d. Could you please explain the logic that follows: if len(directive) == 4:
for param_name, param_directives in recipe.parameters.items():
param = model_find_param(model, param_name)
assert param is not None
for directive in param_directives:
dim = directive[0]
indices = directive[1]
len_indices = indices.nelement()
if len(directive) == 4:
selection_view = param.view(*directive[2])
if param.data.size(dim) != len_indices:
param.data = torch.index_select(selection_view, dim, indices)
param.data = param.view(*directive[3])
[3] Handling of BN layer separately.
You have chosen to handle BN layer separately, was there an inherent restrictive posed by graph traversal or the data structure to do this?
For example you could have added BatchNormalization to the following seccessor_list, in addition to Conv
and Gemm
.
successors = sgraph.successors_f(normalize_module_name(layer_name), ['Conv', 'Gemm'])
and a condition to check isinstance(layers[successor], torch.nn.modules.BatchNormalization)
Let us say we have a BN followed by Conv2d:
VGG(
(features): Sequential(
(0): Conv2d(3, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
(1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
and 64 became 34 due to pruning, we need to adjust the configuration (number of inputs) 64 to 34 in the next BN, my understanding is that this is the only change required and in its parameters (mu and sigma tensors) will remain the same as the values removed are zeros so the mean and the variance will remain unchanged.
[4] Handling gradients
Hi Vinu, I'm currently traveling so it will take me a few days to answer. Cheers Neta
Sure Neta, No Problem.
[1] I was able to easily add support for Thinning FC layers.
Using modules {}
sub-recipe to "Reduce" dimensions and parameters {}
sub-recipe to "Remove" neurons, via the append
api's to access these recipies.
[2] My changes seems to work fine for small networks, I test it by SummaryGraph, Forward pass.
[3] I notice that the accuracy remains the same too, still have not seen a misclassification yet.
Hi Vinu,
Sorry for taking such a long time - I was away in a conference and on vacation. Do you want to share your new code for thinning FC layers?
Cheers, Neta
Hi Vinu,
Sorry for taking such a long time - I was away in a conference and on vacation. Do you want to share your new code for thinning FC layers?
Cheers, Neta
Sure Neta, I will do it this week.
Hi,
Closing this due to staleness.
The thinning methods support only removing channels or filters of a CONV layer
[1] How about thinning FC layers, even if you are not going to support it, can you provide, what all one should take care of if one wants to implement say
remove_rows( )
orremove_columns( )
corresponding to neuron pruning ?[2] Its seems hard to simply extend the thinning_recipe approach as it seems to be too tied to removing CONV structures. Any suggestions ?
[3] Also If we are thinning, pruned pytorch models, what could be the reason for its accuracy drop ? Because we are strictly removing only zero structures, the math should be about the same and cause the same classificaiton ? You seem to be taking into consideration a possible perofrmace drop by preparing to thin even the gradient tensors.