Open aoji0606 opened 1 year ago
Hi @aoji-tjut, please follow the tests/test_torchvision_models.py for vit pruning. It is slightly different from CNN pruning because the inference of ViT relies on some internal variables like num_heads and out_channels. These variables should be manually adjusted after pruning as it is impossible to detect them in an automatic manner.
Here is my result by running the tests/test_torchvision_models.py
vit_b_16
torch.Size([1, 1, 384]) torch.Size([1, 197, 384])
VisionTransformer(
(conv_proj): Conv2d(3, 384, kernel_size=(16, 16), stride=(16, 16))
(encoder): Encoder(
(dropout): Dropout(p=0.0, inplace=False)
(layers): Sequential(
(encoder_layer_0): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_1): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_2): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_3): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_4): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_5): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_6): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_7): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_8): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_9): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_10): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
(encoder_layer_11): EncoderBlock(
(ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(self_attention): MultiheadAttention(
(out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True)
)
(dropout): Dropout(p=0.0, inplace=False)
(ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
(mlp): MLPBlock(
(0): Linear(in_features=384, out_features=1536, bias=True)
(1): GELU(approximate=none)
(2): Dropout(p=0.0, inplace=False)
(3): Linear(in_features=1536, out_features=384, bias=True)
(4): Dropout(p=0.0, inplace=False)
)
)
)
(ln): LayerNorm((384,), eps=1e-06, elementwise_affine=True)
)
(heads): Sequential(
(head): Linear(in_features=384, out_features=1000, bias=True)
)
)
vit_b_16
Params: 86567656 => 22050664
Output: torch.Size([1, 1000])
------------------------------------------------------
ok thank u~
Hi @aoji-tjut, please follow the tests/test_torchvision_models.py for vit pruning. It is slightly different from CNN pruning because the inference of ViT relies on some internal variables like num_heads and out_channels. These variables should be manually adjusted after pruning as it is impossible to detect them in an automatic manner.
Here is my result by running the tests/test_torchvision_models.py
vit_b_16 torch.Size([1, 1, 384]) torch.Size([1, 197, 384]) VisionTransformer( (conv_proj): Conv2d(3, 384, kernel_size=(16, 16), stride=(16, 16)) (encoder): Encoder( (dropout): Dropout(p=0.0, inplace=False) (layers): Sequential( (encoder_layer_0): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_1): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_2): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_3): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_4): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_5): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_6): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_7): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_8): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_9): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_10): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) (encoder_layer_11): EncoderBlock( (ln_1): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (self_attention): MultiheadAttention( (out_proj): NonDynamicallyQuantizableLinear(in_features=384, out_features=384, bias=True) ) (dropout): Dropout(p=0.0, inplace=False) (ln_2): LayerNorm((384,), eps=1e-06, elementwise_affine=True) (mlp): MLPBlock( (0): Linear(in_features=384, out_features=1536, bias=True) (1): GELU(approximate=none) (2): Dropout(p=0.0, inplace=False) (3): Linear(in_features=1536, out_features=384, bias=True) (4): Dropout(p=0.0, inplace=False) ) ) ) (ln): LayerNorm((384,), eps=1e-06, elementwise_affine=True) ) (heads): Sequential( (head): Linear(in_features=384, out_features=1000, bias=True) ) ) vit_b_16 Params: 86567656 => 22050664 Output: torch.Size([1, 1000]) ------------------------------------------------------
Hello, I meet the same issue, but I cannot open the link you provided. Could you please share the code again? Thanks!
As I try to prune Chinese-CLIP model, in the ViT model, there is a "(ln_post): LayerNorm((768,), eps=1e-05, elementwise_affine=True)" layer as the final layer. If I do not add it to the ignored_layers, it keeps report "index 384 is out of bounds for dimension 0 with size 384". However, if I add it to the ignored_layers, then I get a model that only linear layers are pruned, all other layers keeps their dimensions unchanged. Can you give me any suggestion on that?
hello, I want to prune vit_b_16 model with torchvision, but i got this error:
File "test.py", line 39, in prune ignored_layers=ignored_layers, File ".../torch_pruning/pruner/algorithms/metapruner.py", line 67, in init customized_pruners=customized_pruners, File ".../torch_pruning/dependency.py", line 262, in build_dependency self.update_index_mapping() File ".../torch_pruning/dependency.py", line 630, in update_index_mapping self._update_concat_index_mapping(node) File ".../torch_pruning/dependency.py", line 670, in _update_concat_index_mapping offsets.append(offsets[-1] + ch) TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
where "ignored_layers" is the the final classifier layer