Closed cattpku closed 5 years ago
I encounter the same problem and temporary fixed the bug by modify scheduler.py line 103~105 to
self.sched_metadata[policy] = {'starting_epoch': epochs[0],
'ending_epoch': epochs[-1],
'frequency': frequency}
Thanks, Yi-syuan, it worked for me.
@chenys1995 using your modification suggestion leads to another different error ...
[phung@archlinux classifier_compression]$ time python compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10 --epochs=70 --lr=0.1 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml --resume=../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar -j=1 --deterministic Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.17-121949/2018.11.17-121949.log ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10
Logging to TensorBoard - remember to execute the server:
tensorboard --logdir='./logs'
=> loading checkpoint ../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar Checkpoint keys: compression_sched best_top1 optimizer state_dict epoch arch best top@1: 92.920 Loaded compression schedule from checkpoint (epoch 179) => loaded checkpoint '../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar' (epoch 179) Optimizer Type: <class 'torch.optim.sgd.SGD'> Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False} Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 Reading compression schedule from: ../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml
FATAL Parsing error! { "pruner": { "instance_name": "filter_pruner" }, "epochs": [ 180 ] } Exception: <class 'TypeError'> 'int' object is not subscriptable Traceback (most recent call last): File "compress_classifier.py", line 781, in
main() File "compress_classifier.py", line 346, in main compression_scheduler = distiller.file_config(model, optimizer, args.compress) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 146, in file_config return dict_config(model, optimizer, sched_dict) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 109, in dict_config add_policy_to_scheduler(policy, policy_def, schedule) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 133, in add_policy_to_scheduler schedule.add_policy(policy, epochs=policy_def['epochs']) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/scheduler.py", line 103, in add_policy self.sched_metadata[policy] = {'starting_epoch': epoch[0], TypeError: 'int' object is not subscriptable Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.17-121949/2018.11.17-121949.log Traceback (most recent call last): File "compress_classifier.py", line 781, in
main() File "compress_classifier.py", line 346, in main compression_scheduler = distiller.file_config(model, optimizer, args.compress) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 146, in file_config return dict_config(model, optimizer, sched_dict) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 109, in dict_config add_policy_to_scheduler(policy, policy_def, schedule) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 133, in add_policy_to_scheduler schedule.add_policy(policy, epochs=policy_def['epochs']) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/scheduler.py", line 103, in add_policy self.sched_metadata[policy] = {'starting_epoch': epoch[0], TypeError: 'int' object is not subscriptable real 0m7.114s user 0m4.175s sys 0m0.995s [phung@archlinux classifier_compression]$
@promach you should correct 'epoch' to 'epochs'
@chenys1995
I have corrected my typo mistake but it leads me to a totally different error though. Please correct me if I still make some other typo mistakes
[phung@archlinux classifier_compression]$ time python compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10 --epochs=70 --lr=0.1 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml --resume=../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar -j=1 --deterministic Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.19-224347/2018.11.19-224347.log ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10
Logging to TensorBoard - remember to execute the server:
tensorboard --logdir='./logs'
=> loading checkpoint ../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar Checkpoint keys: compression_sched best_top1 optimizer state_dict epoch arch best top@1: 92.920 Loaded compression schedule from checkpoint (epoch 179) => loaded checkpoint '../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar' (epoch 179) Optimizer Type: <class 'torch.optim.sgd.SGD'> Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False} Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 Reading compression schedule from: ../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml
L1RankedStructureParameterPruner - param: module.layer1.0.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.1.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.2.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.3.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.4.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.5.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.6.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.7.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.8.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer2.1.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.2.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.3.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.4.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.6.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.7.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer3.1.conv1.weight pruned=0.188 goal=0.200 (12/64) L1RankedStructureParameterPruner - param: module.layer3.2.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.3.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.5.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.6.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.7.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.8.conv1.weight pruned=0.391 goal=0.400 (25/64) Training epoch: 45000 samples (256 per mini-batch) ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10 Traceback (most recent call last): File "compress_classifier.py", line 781, in
main() File "compress_classifier.py", line 378, in main loggers=[tflogger, pylogger], args=args) File "compress_classifier.py", line 457, in train compression_scheduler.on_minibatch_begin(epoch, train_step, steps_per_epoch, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/scheduler.py", line 120, in on_minibatch_begin self.zeros_mask_dict, meta, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 395, in on_minibatch_begin self.apply(model, zeros_mask_dict, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 386, in __apply self.thinning_func(model, zeros_mask_dict, self.arch, self.dataset, optimizer=optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 200, in remove_filters sgraph = create_graph(dataset, arch) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 75, in create_graph return SummaryGraph(model, dummy_input.cuda()) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/apputils/model_summaries.py", line 102, in init torch.onnx._optimize_trace(trace, False) File "/usr/lib/python3.7/site-packages/torch/onnx/init__.py", line 42, in _optimize_trace trace.set_graph(utils._optimize_graph(trace.graph(), operator_export_type)) File "/usr/lib/python3.7/site-packages/torch/onnx/utils.py", line 153, in _optimize_graph if operator_export_type != OperatorExportTypes.RAW: TypeError: ne(): incompatible function arguments. The following argument types are supported:
- (self: torch._C._onnx.OperatorExportTypes, arg0: torch._C._onnx.OperatorExportTypes) -> bool
Invoked with: OperatorExportTypes.RAW, False
Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.19-224347/2018.11.19-224347.log Traceback (most recent call last): File "compress_classifier.py", line 781, in
main() File "compress_classifier.py", line 378, in main loggers=[tflogger, pylogger], args=args) File "compress_classifier.py", line 457, in train compression_scheduler.on_minibatch_begin(epoch, train_step, steps_per_epoch, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/scheduler.py", line 120, in on_minibatch_begin self.zeros_mask_dict, meta, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 395, in on_minibatch_begin self.apply(model, zeros_mask_dict, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 386, in __apply self.thinning_func(model, zeros_mask_dict, self.arch, self.dataset, optimizer=optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 200, in remove_filters sgraph = create_graph(dataset, arch) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 75, in create_graph return SummaryGraph(model, dummy_input.cuda()) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/apputils/model_summaries.py", line 102, in init torch.onnx._optimize_trace(trace, False) File "/usr/lib/python3.7/site-packages/torch/onnx/init__.py", line 42, in _optimize_trace trace.set_graph(utils._optimize_graph(trace.graph(), operator_export_type)) File "/usr/lib/python3.7/site-packages/torch/onnx/utils.py", line 153, in _optimize_graph if operator_export_type != OperatorExportTypes.RAW: TypeError: ne(): incompatible function arguments. The following argument types are supported:
- (self: torch._C._onnx.OperatorExportTypes, arg0: torch._C._onnx.OperatorExportTypes) -> bool
Invoked with: OperatorExportTypes.RAW, False
real 0m6.110s user 0m4.491s sys 0m1.179s [phung@archlinux classifier_compression]$
Hi Neta, Thanks for fixing the issue. While when I am trying to do filters and channels pruning, some errors occurred.
L1RankedStructureParameterPruner - param: module.layer3.1.output.2.0.weight pruned=0.500 goal=0.500 (12/24) L1RankedStructureParameterPruner - param: module.layer4.2.output.2.0.weight pruned=0.500 goal=0.500 (16/32) L1RankedStructureParameterPruner - param: module.layer5.3.output.2.0.weight pruned=0.500 goal=0.500 (32/64) L1RankedStructureParameterPruner - param: module.layer6.2.output.2.0.weight pruned=0.500 goal=0.500 (48/96) L1RankedStructureParameterPruner - param: module.layer7.2.output.2.0.weight pruned=0.500 goal=0.500 (80/160) L1RankedStructureParameterPruner - param: module.layer8.0.output.2.0.weight pruned=0.500 goal=0.500 (160/320) L1RankedStructureParameterPruner - param: module.crp4.0.1_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp4.0.3_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp3.0.1_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp3.0.3_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp2.0.1_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp2.0.3_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp1.0.1_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp1.0.3_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256)
Invoking create_thinning_recipe_filters In tensor module.layer3.1.output.2.0.weight found 12/24 zero filters In tensor module.layer4.2.output.2.0.weight found 16/32 zero filters In tensor module.layer5.3.output.2.0.weight found 32/64 zero filters In tensor module.layer6.2.output.2.0.weight found 48/96 zero filters In tensor module.layer7.2.output.2.0.weight found 80/160 zero filters In tensor module.layer8.0.output.2.0.weight found 160/320 zero filters In tensor module.crp4.0.1_outvar_dimred.weight found 128/256 zero filters In tensor module.crp4.0.3_outvar_dimred.weight found 128/256 zero filters In tensor module.crp3.0.1_outvar_dimred.weight found 128/256 zero filters In tensor module.crp3.0.3_outvar_dimred.weight found 128/256 zero filters In tensor module.crp2.0.1_outvar_dimred.weight found 128/256 zero filters In tensor module.crp2.0.3_outvar_dimred.weight found 128/256 zero filters In tensor module.crp1.0.1_outvar_dimred.weight found 128/256 zero filters In tensor module.crp1.0.3_outvar_dimred.weight found 128/256 zero filters Created, applied and saved a thinning recipe
But finally got an error 'RuntimeError: running_mean should contain 12 elements not 24'. It seems the BN layer has not been pruned after Conv layer.
Can you kindly suggest the solution for it? Thanks.
However, it gave me another error: Invoking create_thinning_recipe_channels In tensor module.layer1.0.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.0.conv1 normal=layer1.0.conv1 module.layer1.0.conv1 In tensor module.layer1.1.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.1.conv1 normal=layer1.1.conv1 module.layer1.1.conv1 In tensor module.layer1.2.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.2.conv1 normal=layer1.2.conv1 module.layer1.2.conv1 In tensor module.layer1.3.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.3.conv1 normal=layer1.3.conv1 module.layer1.3.conv1 In tensor module.layer1.4.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.4.conv1 normal=layer1.4.conv1 module.layer1.4.conv1 In tensor module.layer1.5.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.5.conv1 normal=layer1.5.conv1 module.layer1.5.conv1 In tensor module.layer1.6.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.6.conv1 normal=layer1.6.conv1 module.layer1.6.conv1 In tensor module.layer1.7.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.7.conv1 normal=layer1.7.conv1 module.layer1.7.conv1 In tensor module.layer1.8.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.8.conv1 normal=layer1.8.conv1 module.layer1.8.conv1 In tensor module.layer2.1.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.1.conv1 normal=layer2.1.conv1 module.layer2.1.conv1 In tensor module.layer2.2.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.2.conv1 normal=layer2.2.conv1 module.layer2.2.conv1 In tensor module.layer2.3.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.3.conv1 normal=layer2.3.conv1 module.layer2.3.conv1 In tensor module.layer2.4.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.4.conv1 normal=layer2.4.conv1 module.layer2.4.conv1 In tensor module.layer2.6.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.6.conv1 normal=layer2.6.conv1 module.layer2.6.conv1 In tensor module.layer2.7.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.7.conv1 normal=layer2.7.conv1 module.layer2.7.conv1 In tensor module.layer3.1.conv1.weight found 6/64 zero channels Could not find predecessors for name=module.layer3.1.conv1 normal=layer3.1.conv1 module.layer3.1.conv1 In tensor module.layer3.2.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.2.conv1 normal=layer3.2.conv1 module.layer3.2.conv1 In tensor module.layer3.3.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.3.conv1 normal=layer3.3.conv1 module.layer3.3.conv1 In tensor module.layer3.5.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.5.conv1 normal=layer3.5.conv1 module.layer3.5.conv1 In tensor module.layer3.6.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.6.conv1 normal=layer3.6.conv1 module.layer3.6.conv1 In tensor module.layer3.7.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.7.conv1 normal=layer3.7.conv1 module.layer3.7.conv1 In tensor module.layer3.8.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.8.conv1 normal=layer3.8.conv1 module.layer3.8.conv1 Created, applied and saved a thinning recipe ==> Best Top1: 93.640 on Epoch: 180 Saving checkpoint to: logs/2018.11.21-162715/checkpoint.pth.tar
Training epoch: 45000 samples (256 per mini-batch)
Traceback (most recent call last):
File "/home/chongyu/application/distiller/examples/classifier_compression/compress_classifier.py", line 787, in
Can you kindly help for this error? Thanks.
I think I may find some clue for my Q1. In 'thinning.py' L328 and L357, it is trying to look for layers that the 'type' is 'Conv', 'Gemm' or 'BatchNormalization', but in 'model_summaries.py' L307, I found the 'type' is 'aten::_convolution' or 'aten::batch_norm', which means the 'type' values do not match each other.
Hi @cattpku ,
The PR I just merged only contains a fix to the original problem that you reported.
I suspect that I know the source of the other 2 issues you have - (1) and (2) above, but I can't be sure.
Can you share the definition of your network? It will help for me to recreate the issue and understand it.
If you can't share, then in a few days I will push a fix to the problem I see (which occurs in networks that have complex data-dependencies, such as in ResNet) and hopefully it will also fix your issue.
Cheers, Neta
Hi Neta,
Thanks for your reply. For Q1, the network definition is here https://github.com/DrSleep/light-weight-refinenet/blob/master/models/mobilenet.py, and for Q2, I was using the 'resnet56_cifar', which is from the example.
Hi @cattpku,
I couldn't use this network because it has some issues when I export it to ONNX. I think this has to do with pytorch version compatibilities, but I didn't fight it.
I tried recreating (2). I don't see the problem that you describe, but I see a different issue - this is the issue I am working on and is related to my discussion with @vinutah in issue #73. To make sure we are aligned, I'm attaching the YAML schedule for (2), which I created according to the instruction you provided above (I changed the ending to .txt from .yaml so that I can attach it).
resnet56_cifar_filter_rank_cattpku.txt
The error I get is this:
--- validate (epoch=180)-----------
5000 samples (256 per mini-batch)
==> Top1: 86.380 Top5: 99.360 Loss: 0.468
==> using cifar10 dataset
=> creating resnet56_cifar model for CIFAR10
Invoking create_thinning_recipe_channels
In tensor module.layer1.0.conv1.weight found 9/16 zero channels
In tensor module.layer1.1.conv1.weight found 9/16 zero channels
Traceback (most recent call last):
File "compress_classifier.py", line 788, in <module>
main()
File "compress_classifier.py", line 407, in main
compression_scheduler.on_epoch_end(epoch, optimizer)
File "/home/cvds_lab/nzmora/sandbox_5/distiller/distiller/scheduler.py", line 161, in on_epoch_end
policy.on_epoch_end(self.model, self.zeros_mask_dict, meta)
File "/home/cvds_lab/nzmora/sandbox_5/distiller/distiller/thinning.py", line 374, in on_epoch_end
self.thinning_func(model, zeros_mask_dict, self.arch, self.dataset, meta.get('optimizer', None))
File "/home/cvds_lab/nzmora/sandbox_5/distiller/distiller/thinning.py", line 147, in remove_channels
thinning_recipe = create_thinning_recipe_channels(sgraph, model, zeros_mask_dict)
File "/home/cvds_lab/nzmora/sandbox_5/distiller/distiller/thinning.py", line 266, in create_thinning_recipe_channels
assert len(bn_layers) == 1
AssertionError
This is caused by the complex data-dependencies that exist in ResNet, and I'm addressing this issue. I modified this notebook to show the dependencies and I'm attaching here (again, I change the file type to .txt from .ipynb).
pruning_channels_and_filters-cattpku.txt
The output looks like this:
The green oval is module whose channels we reduce; the red ovals are the modules (layers) that Distiller determined have a data-dependency and need to be changed. The "thinning" code doesn't currently handle these dependencies well.
Cheers Neta
Hi Neta,
I also tried to export it to ONNX, and I failed too...
I just compared the YAML scheduler you sent with the one I used, they are exactly the same, but I reproduced the same error as my Q2. After debugging, I found the reason is the same as what I found for my Q1, that is:
predecessors = sgraph.predecessors_f(normalize_module_name(layer_name), ['Conv']),
I need to change it to
predecessors = sgraph.predecessors_f(normalize_module_name(layer_name), ['aten::_convolution'])
same as bn_layers
bn_layers = sgraph.predecessors_f(normalize_module_name(layer_name), ['aten::batch_norm'])
With such modification, I got the same error like you show above.
This is strange - I don't understand at the moment why you are seeing this.
Can you send me your environment configuration? You will find this information in any of the log files - in the top ~15 lines. Thanks Neta
It is weird, I use 'nvcc --version' to check the CUDA version, it is 'release 8.0, V8.0.61', but how come the log shows 'CUDA version: 9.0.176'?
2018-11-23 00:51:58,493 - Log file for this run: /home/chongyu/application/distiller/examples/classifier_compression/logs/2018.11.23-005158/2018.11.23-005158.log 2018-11-23 00:51:58,493 - Number of CPUs: 32 2018-11-23 00:51:58,516 - Number of GPUs: 1 2018-11-23 00:51:58,516 - CUDA version: 9.0.176 2018-11-23 00:51:58,516 - CUDNN version: 7102 2018-11-23 00:51:58,516 - Kernel: 4.15.0-36-generic 2018-11-23 00:51:58,516 - Python: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0] 2018-11-23 00:51:58,516 - PyTorch: 0.4.1 2018-11-23 00:51:58,516 - Numpy: 1.14.3 2018-11-23 00:51:58,531 - Git is dirty 2018-11-23 00:51:58,531 - Active Git branch: master 2018-11-23 00:51:58,538 - Git commit: ff6985adfc0e8b1b2a15a3b58d8675997f5e79d2 2018-11-23 00:51:58,538 - App args: ['/home/chongyu/application/distiller/examples/classifier_compression/compress_classifier.py'] 2018-11-23 00:51:58,539 - ==> using cifar10 dataset
Can you switch for a second to pytorch 0.4.0 to see if you still see 'aten::_convolution' and 'aten::batch_norm'? I have a hunch Facebook touched this ONNX code. Thanks
You are right, once I switch to pytorch 0.4.0, the 'type' changed to 'Conv' and 'BatchNormalization'. But I got some new error like: Unexpected key(s) in state_dict: "module.layer1.1.num_batches_tracked", "module.layer2.0.output.0.1.num_batches_tracked", ....... I have no idea where the 'num_batches_tracked' comes from, but once switch back to 0.4.1, no such error occurred.
Yes, thanks for your reminder. Btw, can I ask whether the filter-wise or channel-wise pruning support pruning group convolution layers? As I am currently testing these two kind of pruning on MobileNetV2 and ShuffleNetV2, both contain group convolution, after fixing the errors I reported above, the pruning example provided can run smoothly, but my own pruning always fail with such error: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorCopy.cpp:70, after debugging I found error in 'thinning.py' param.data = torch.index_select(param.data, dim, indices), the value of 'indices' is outbound of 'param.data' size. I am not sure why it happens, hope you can suggest. Thanks.
@cattpku Did you encounter https://github.com/NervanaSystems/distiller/issues/79#issuecomment-439948112 even after doing a git pull for the latest commit ?
@promach I also met this problem with pytorch 0.4.1, and I compared the 'utils.py' for both 0.4.0 and 0.4.1, then I just comment out the parts 'if operator_export_type != OperatorExportTypes.RAW'
@cattpku I am not really sure if your quick hack in utils.py actually solves everything because it leads me to another error in functional.py which I am not familiar at all.
[phung@archlinux classifier_compression]$ time python compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10 --epochs=70 --lr=0.1 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml --resume=../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar -j=1 --deterministic Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.26-231318/2018.11.26-231318.log ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10
Logging to TensorBoard - remember to execute the server:
tensorboard --logdir='./logs'
=> loading checkpoint ../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar Checkpoint keys: compression_sched best_top1 optimizer state_dict epoch arch best top@1: 92.920 Loaded compression schedule from checkpoint (epoch 179) => loaded checkpoint '../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar' (epoch 179) Optimizer Type: <class 'torch.optim.sgd.SGD'> Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False} Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 Reading compression schedule from: ../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml
L1RankedStructureParameterPruner - param: module.layer1.0.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.1.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.2.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.3.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.4.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.5.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.6.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.7.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.8.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer2.1.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.2.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.3.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.4.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.6.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.7.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer3.1.conv1.weight pruned=0.188 goal=0.200 (12/64) L1RankedStructureParameterPruner - param: module.layer3.2.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.3.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.5.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.6.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.7.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.8.conv1.weight pruned=0.391 goal=0.400 (25/64) Training epoch: 45000 samples (256 per mini-batch) ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10 Invoking create_thinning_recipe_filters In tensor module.layer1.0.conv1.weight found 11/16 zero filters In tensor module.layer1.1.conv1.weight found 11/16 zero filters In tensor module.layer1.2.conv1.weight found 11/16 zero filters In tensor module.layer1.3.conv1.weight found 11/16 zero filters In tensor module.layer1.4.conv1.weight found 11/16 zero filters In tensor module.layer1.5.conv1.weight found 11/16 zero filters In tensor module.layer1.6.conv1.weight found 11/16 zero filters In tensor module.layer1.7.conv1.weight found 11/16 zero filters In tensor module.layer1.8.conv1.weight found 11/16 zero filters In tensor module.layer2.1.conv1.weight found 19/32 zero filters In tensor module.layer2.2.conv1.weight found 19/32 zero filters In tensor module.layer2.3.conv1.weight found 19/32 zero filters In tensor module.layer2.4.conv1.weight found 19/32 zero filters In tensor module.layer2.6.conv1.weight found 19/32 zero filters In tensor module.layer2.7.conv1.weight found 19/32 zero filters In tensor module.layer3.1.conv1.weight found 12/64 zero filters In tensor module.layer3.2.conv1.weight found 25/64 zero filters In tensor module.layer3.3.conv1.weight found 25/64 zero filters In tensor module.layer3.5.conv1.weight found 25/64 zero filters In tensor module.layer3.6.conv1.weight found 25/64 zero filters In tensor module.layer3.7.conv1.weight found 25/64 zero filters In tensor module.layer3.8.conv1.weight found 25/64 zero filters Created, applied and saved a thinning recipe
Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.26-231318/2018.11.26-231318.log Traceback (most recent call last): File "compress_classifier.py", line 789, in
main() File "compress_classifier.py", line 386, in main loggers=[tflogger, pylogger], args=args) File "compress_classifier.py", line 467, in train output = model(inputs) File "/usr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/usr/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward return self.module(*inputs[0], *kwargs[0]) File "/usr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/models/cifar10/resnet_cifar.py", line 140, in forward x = self.layer1(x) File "/usr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/usr/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/usr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/models/cifar10/resnet_cifar.py", line 71, in forward out = self.bn1(out) File "/usr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/usr/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 67, in forward exponential_average_factor, self.eps) File "/usr/lib/python3.7/site-packages/torch/nn/functional.py", line 1349, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: running_mean should contain 5 elements not 16 real 0m7.277s user 0m6.356s sys 0m1.016s [phung@archlinux classifier_compression]$
@promach it is also caused by 0.4.1, please check the posts above
Hi Neta,
I tried to run the sensitivity analysis for filter with the following command 'python3 compress_classifier.py -a resnet20_cifar --data ../../../data.cifar10/ -j 12 --resume=../ssl/checkpoints/checkpoint_trained_dense.pth.tar --sense=filter', but got an error, detailed log:
Logging to TensorBoard - remember to execute the server:
=> loading checkpoint ../ssl/checkpoints/checkpoint_trained_dense.pth.tar Checkpoint keys: arch optimizer compression_sched state_dict best_top1 epoch best top@1: 92.540 Loaded compression schedule from checkpoint (epoch 179) => loaded checkpoint '../ssl/checkpoints/checkpoint_trained_dense.pth.tar' (epoch 179) Optimizer Type: <class 'torch.optim.sgd.SGD'> Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False} Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 Running sensitivity tests Testing sensitivity of module.conv1.weight [0.0% sparsity] Traceback (most recent call last): File "compress_classifier.py", line 782, in
main()
File "compress_classifier.py", line 339, in main
return sensitivity_analysis(model, criterion, test_loader, pylogger, args)
File "compress_classifier.py", line 750, in sensitivity_analysis
group=args.sensitivity)
File "/home/chongyu/application/distiller/distiller/sensitivity.py", line 108, in perform_sensitivity_analysis
scheduler.on_epoch_begin(0)
File "/home/chongyu/application/distiller/distiller/scheduler.py", line 112, in on_epoch_begin
policy.on_epoch_begin(self.model, self.zeros_mask_dict, meta)
File "/home/chongyu/application/distiller/distiller/policy.py", line 123, in on_epoch_begin
self.is_last_epoch = meta['current_epoch'] == (meta['ending_epoch'] - 1)
TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'
It looks like there is no valid value for meta['ending_epoch']. Can you kindly suggest how to solve it? Thanks.