IntelLabs / distiller

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
Apache License 2.0
4.34k stars 800 forks source link

sensitivity analysis fail #79

Closed cattpku closed 5 years ago

cattpku commented 5 years ago

Hi Neta,

I tried to run the sensitivity analysis for filter with the following command 'python3 compress_classifier.py -a resnet20_cifar --data ../../../data.cifar10/ -j 12 --resume=../ssl/checkpoints/checkpoint_trained_dense.pth.tar --sense=filter', but got an error, detailed log:

Logging to TensorBoard - remember to execute the server:

tensorboard --logdir='./logs'

=> loading checkpoint ../ssl/checkpoints/checkpoint_trained_dense.pth.tar Checkpoint keys: arch optimizer compression_sched state_dict best_top1 epoch best top@1: 92.540 Loaded compression schedule from checkpoint (epoch 179) => loaded checkpoint '../ssl/checkpoints/checkpoint_trained_dense.pth.tar' (epoch 179) Optimizer Type: <class 'torch.optim.sgd.SGD'> Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False} Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 Running sensitivity tests Testing sensitivity of module.conv1.weight [0.0% sparsity] Traceback (most recent call last): File "compress_classifier.py", line 782, in main() File "compress_classifier.py", line 339, in main return sensitivity_analysis(model, criterion, test_loader, pylogger, args) File "compress_classifier.py", line 750, in sensitivity_analysis group=args.sensitivity) File "/home/chongyu/application/distiller/distiller/sensitivity.py", line 108, in perform_sensitivity_analysis scheduler.on_epoch_begin(0) File "/home/chongyu/application/distiller/distiller/scheduler.py", line 112, in on_epoch_begin policy.on_epoch_begin(self.model, self.zeros_mask_dict, meta) File "/home/chongyu/application/distiller/distiller/policy.py", line 123, in on_epoch_begin self.is_last_epoch = meta['current_epoch'] == (meta['ending_epoch'] - 1) TypeError: unsupported operand type(s) for -: 'NoneType' and 'int'

It looks like there is no valid value for meta['ending_epoch']. Can you kindly suggest how to solve it? Thanks.

chenys1995 commented 5 years ago

I encounter the same problem and temporary fixed the bug by modify scheduler.py line 103~105 to

    self.sched_metadata[policy] = {'starting_epoch': epochs[0],
                                   'ending_epoch': epochs[-1],
                                   'frequency': frequency}
cattpku commented 5 years ago

Thanks, Yi-syuan, it worked for me.

buttercutter commented 5 years ago

@chenys1995 using your modification suggestion leads to another different error ...

[phung@archlinux classifier_compression]$ time python compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10 --epochs=70 --lr=0.1 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml --resume=../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar -j=1 --deterministic Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.17-121949/2018.11.17-121949.log ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10


Logging to TensorBoard - remember to execute the server:

tensorboard --logdir='./logs'

=> loading checkpoint ../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar Checkpoint keys: compression_sched best_top1 optimizer state_dict epoch arch best top@1: 92.920 Loaded compression schedule from checkpoint (epoch 179) => loaded checkpoint '../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar' (epoch 179) Optimizer Type: <class 'torch.optim.sgd.SGD'> Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False} Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 Reading compression schedule from: ../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml

FATAL Parsing error! { "pruner": { "instance_name": "filter_pruner" }, "epochs": [ 180 ] } Exception: <class 'TypeError'> 'int' object is not subscriptable Traceback (most recent call last): File "compress_classifier.py", line 781, in main() File "compress_classifier.py", line 346, in main compression_scheduler = distiller.file_config(model, optimizer, args.compress) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 146, in file_config return dict_config(model, optimizer, sched_dict) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 109, in dict_config add_policy_to_scheduler(policy, policy_def, schedule) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 133, in add_policy_to_scheduler schedule.add_policy(policy, epochs=policy_def['epochs']) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/scheduler.py", line 103, in add_policy self.sched_metadata[policy] = {'starting_epoch': epoch[0], TypeError: 'int' object is not subscriptable

Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.17-121949/2018.11.17-121949.log Traceback (most recent call last): File "compress_classifier.py", line 781, in main() File "compress_classifier.py", line 346, in main compression_scheduler = distiller.file_config(model, optimizer, args.compress) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 146, in file_config return dict_config(model, optimizer, sched_dict) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 109, in dict_config add_policy_to_scheduler(policy, policy_def, schedule) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/config.py", line 133, in add_policy_to_scheduler schedule.add_policy(policy, epochs=policy_def['epochs']) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/scheduler.py", line 103, in add_policy self.sched_metadata[policy] = {'starting_epoch': epoch[0], TypeError: 'int' object is not subscriptable

real 0m7.114s user 0m4.175s sys 0m0.995s [phung@archlinux classifier_compression]$

chenys1995 commented 5 years ago

@promach you should correct 'epoch' to 'epochs'

buttercutter commented 5 years ago

@chenys1995

I have corrected my typo mistake but it leads me to a totally different error though. Please correct me if I still make some other typo mistakes

[phung@archlinux classifier_compression]$ time python compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10 --epochs=70 --lr=0.1 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml --resume=../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar -j=1 --deterministic Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.19-224347/2018.11.19-224347.log ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10


Logging to TensorBoard - remember to execute the server:

tensorboard --logdir='./logs'

=> loading checkpoint ../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar Checkpoint keys: compression_sched best_top1 optimizer state_dict epoch arch best top@1: 92.920 Loaded compression schedule from checkpoint (epoch 179) => loaded checkpoint '../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar' (epoch 179) Optimizer Type: <class 'torch.optim.sgd.SGD'> Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False} Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 Reading compression schedule from: ../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml

L1RankedStructureParameterPruner - param: module.layer1.0.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.1.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.2.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.3.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.4.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.5.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.6.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.7.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.8.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer2.1.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.2.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.3.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.4.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.6.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.7.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer3.1.conv1.weight pruned=0.188 goal=0.200 (12/64) L1RankedStructureParameterPruner - param: module.layer3.2.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.3.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.5.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.6.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.7.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.8.conv1.weight pruned=0.391 goal=0.400 (25/64) Training epoch: 45000 samples (256 per mini-batch) ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10 Traceback (most recent call last): File "compress_classifier.py", line 781, in main() File "compress_classifier.py", line 378, in main loggers=[tflogger, pylogger], args=args) File "compress_classifier.py", line 457, in train compression_scheduler.on_minibatch_begin(epoch, train_step, steps_per_epoch, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/scheduler.py", line 120, in on_minibatch_begin self.zeros_mask_dict, meta, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 395, in on_minibatch_begin self.apply(model, zeros_mask_dict, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 386, in __apply self.thinning_func(model, zeros_mask_dict, self.arch, self.dataset, optimizer=optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 200, in remove_filters sgraph = create_graph(dataset, arch) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 75, in create_graph return SummaryGraph(model, dummy_input.cuda()) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/apputils/model_summaries.py", line 102, in init torch.onnx._optimize_trace(trace, False) File "/usr/lib/python3.7/site-packages/torch/onnx/init__.py", line 42, in _optimize_trace trace.set_graph(utils._optimize_graph(trace.graph(), operator_export_type)) File "/usr/lib/python3.7/site-packages/torch/onnx/utils.py", line 153, in _optimize_graph if operator_export_type != OperatorExportTypes.RAW: TypeError: ne(): incompatible function arguments. The following argument types are supported:

  1. (self: torch._C._onnx.OperatorExportTypes, arg0: torch._C._onnx.OperatorExportTypes) -> bool

Invoked with: OperatorExportTypes.RAW, False

Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.19-224347/2018.11.19-224347.log Traceback (most recent call last): File "compress_classifier.py", line 781, in main() File "compress_classifier.py", line 378, in main loggers=[tflogger, pylogger], args=args) File "compress_classifier.py", line 457, in train compression_scheduler.on_minibatch_begin(epoch, train_step, steps_per_epoch, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/scheduler.py", line 120, in on_minibatch_begin self.zeros_mask_dict, meta, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 395, in on_minibatch_begin self.apply(model, zeros_mask_dict, optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 386, in __apply self.thinning_func(model, zeros_mask_dict, self.arch, self.dataset, optimizer=optimizer) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 200, in remove_filters sgraph = create_graph(dataset, arch) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/distiller/thinning.py", line 75, in create_graph return SummaryGraph(model, dummy_input.cuda()) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/apputils/model_summaries.py", line 102, in init torch.onnx._optimize_trace(trace, False) File "/usr/lib/python3.7/site-packages/torch/onnx/init__.py", line 42, in _optimize_trace trace.set_graph(utils._optimize_graph(trace.graph(), operator_export_type)) File "/usr/lib/python3.7/site-packages/torch/onnx/utils.py", line 153, in _optimize_graph if operator_export_type != OperatorExportTypes.RAW: TypeError: ne(): incompatible function arguments. The following argument types are supported:

  1. (self: torch._C._onnx.OperatorExportTypes, arg0: torch._C._onnx.OperatorExportTypes) -> bool

Invoked with: OperatorExportTypes.RAW, False

real 0m6.110s user 0m4.491s sys 0m1.179s [phung@archlinux classifier_compression]$

cattpku commented 5 years ago

Hi Neta, Thanks for fixing the issue. While when I am trying to do filters and channels pruning, some errors occurred.

  1. Firstly I try to do filter-wise pruning with my own network, following the example 'resnet56_cifar_filter_rank.yaml', looks like the recipe has been successfully applied to Conv layers, details:

L1RankedStructureParameterPruner - param: module.layer3.1.output.2.0.weight pruned=0.500 goal=0.500 (12/24) L1RankedStructureParameterPruner - param: module.layer4.2.output.2.0.weight pruned=0.500 goal=0.500 (16/32) L1RankedStructureParameterPruner - param: module.layer5.3.output.2.0.weight pruned=0.500 goal=0.500 (32/64) L1RankedStructureParameterPruner - param: module.layer6.2.output.2.0.weight pruned=0.500 goal=0.500 (48/96) L1RankedStructureParameterPruner - param: module.layer7.2.output.2.0.weight pruned=0.500 goal=0.500 (80/160) L1RankedStructureParameterPruner - param: module.layer8.0.output.2.0.weight pruned=0.500 goal=0.500 (160/320) L1RankedStructureParameterPruner - param: module.crp4.0.1_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp4.0.3_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp3.0.1_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp3.0.3_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp2.0.1_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp2.0.3_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp1.0.1_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256) L1RankedStructureParameterPruner - param: module.crp1.0.3_outvar_dimred.weight pruned=0.500 goal=0.500 (128/256)

Invoking create_thinning_recipe_filters In tensor module.layer3.1.output.2.0.weight found 12/24 zero filters In tensor module.layer4.2.output.2.0.weight found 16/32 zero filters In tensor module.layer5.3.output.2.0.weight found 32/64 zero filters In tensor module.layer6.2.output.2.0.weight found 48/96 zero filters In tensor module.layer7.2.output.2.0.weight found 80/160 zero filters In tensor module.layer8.0.output.2.0.weight found 160/320 zero filters In tensor module.crp4.0.1_outvar_dimred.weight found 128/256 zero filters In tensor module.crp4.0.3_outvar_dimred.weight found 128/256 zero filters In tensor module.crp3.0.1_outvar_dimred.weight found 128/256 zero filters In tensor module.crp3.0.3_outvar_dimred.weight found 128/256 zero filters In tensor module.crp2.0.1_outvar_dimred.weight found 128/256 zero filters In tensor module.crp2.0.3_outvar_dimred.weight found 128/256 zero filters In tensor module.crp1.0.1_outvar_dimred.weight found 128/256 zero filters In tensor module.crp1.0.3_outvar_dimred.weight found 128/256 zero filters Created, applied and saved a thinning recipe

But finally got an error 'RuntimeError: running_mean should contain 12 elements not 24'. It seems the BN layer has not been pruned after Conv layer.

Can you kindly suggest the solution for it? Thanks.

cattpku commented 5 years ago
  1. Following previous run, I tried another test for channel-wise pruning, by updating the configuration file 'resnet56_cifar_filter_rank.yaml'. I changed '3D' to 'Channels' for all layers, and then updated the 'net_thinner' as following: extensions: net_thinner: class: 'ChannelRemover' thinning_func_str: remove_channels arch: 'resnet56_cifar' dataset: 'cifar10'

However, it gave me another error: Invoking create_thinning_recipe_channels In tensor module.layer1.0.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.0.conv1 normal=layer1.0.conv1 module.layer1.0.conv1 In tensor module.layer1.1.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.1.conv1 normal=layer1.1.conv1 module.layer1.1.conv1 In tensor module.layer1.2.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.2.conv1 normal=layer1.2.conv1 module.layer1.2.conv1 In tensor module.layer1.3.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.3.conv1 normal=layer1.3.conv1 module.layer1.3.conv1 In tensor module.layer1.4.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.4.conv1 normal=layer1.4.conv1 module.layer1.4.conv1 In tensor module.layer1.5.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.5.conv1 normal=layer1.5.conv1 module.layer1.5.conv1 In tensor module.layer1.6.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.6.conv1 normal=layer1.6.conv1 module.layer1.6.conv1 In tensor module.layer1.7.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.7.conv1 normal=layer1.7.conv1 module.layer1.7.conv1 In tensor module.layer1.8.conv1.weight found 9/16 zero channels Could not find predecessors for name=module.layer1.8.conv1 normal=layer1.8.conv1 module.layer1.8.conv1 In tensor module.layer2.1.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.1.conv1 normal=layer2.1.conv1 module.layer2.1.conv1 In tensor module.layer2.2.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.2.conv1 normal=layer2.2.conv1 module.layer2.2.conv1 In tensor module.layer2.3.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.3.conv1 normal=layer2.3.conv1 module.layer2.3.conv1 In tensor module.layer2.4.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.4.conv1 normal=layer2.4.conv1 module.layer2.4.conv1 In tensor module.layer2.6.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.6.conv1 normal=layer2.6.conv1 module.layer2.6.conv1 In tensor module.layer2.7.conv1.weight found 16/32 zero channels Could not find predecessors for name=module.layer2.7.conv1 normal=layer2.7.conv1 module.layer2.7.conv1 In tensor module.layer3.1.conv1.weight found 6/64 zero channels Could not find predecessors for name=module.layer3.1.conv1 normal=layer3.1.conv1 module.layer3.1.conv1 In tensor module.layer3.2.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.2.conv1 normal=layer3.2.conv1 module.layer3.2.conv1 In tensor module.layer3.3.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.3.conv1 normal=layer3.3.conv1 module.layer3.3.conv1 In tensor module.layer3.5.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.5.conv1 normal=layer3.5.conv1 module.layer3.5.conv1 In tensor module.layer3.6.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.6.conv1 normal=layer3.6.conv1 module.layer3.6.conv1 In tensor module.layer3.7.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.7.conv1 normal=layer3.7.conv1 module.layer3.7.conv1 In tensor module.layer3.8.conv1.weight found 19/64 zero channels Could not find predecessors for name=module.layer3.8.conv1 normal=layer3.8.conv1 module.layer3.8.conv1 Created, applied and saved a thinning recipe ==> Best Top1: 93.640 on Epoch: 180 Saving checkpoint to: logs/2018.11.21-162715/checkpoint.pth.tar

Training epoch: 45000 samples (256 per mini-batch) Traceback (most recent call last): File "/home/chongyu/application/distiller/examples/classifier_compression/compress_classifier.py", line 787, in main() File "/home/chongyu/application/distiller/examples/classifier_compression/compress_classifier.py", line 384, in main loggers=[tflogger, pylogger], args=args) File "/home/chongyu/application/distiller/examples/classifier_compression/compress_classifier.py", line 466, in train output = model(inputs) File "/home/chongyu/application/distiller/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/home/chongyu/application/distiller/env/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 121, in forward return self.module(*inputs[0], *kwargs[0]) File "/home/chongyu/application/distiller/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/home/chongyu/application/distiller/models/cifar10/resnet_cifar.py", line 140, in forward x = self.layer1(x) File "/home/chongyu/application/distiller/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/home/chongyu/application/distiller/env/lib/python3.6/site-packages/torch/nn/modules/container.py", line 91, in forward input = module(input) File "/home/chongyu/application/distiller/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/chongyu/application/distiller/models/cifar10/resnet_cifar.py", line 70, in forward out = self.conv1(x) File "/home/chongyu/application/distiller/env/lib/python3.6/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/home/chongyu/application/distiller/env/lib/python3.6/site-packages/torch/nn/modules/conv.py", line 301, in forward self.padding, self.dilation, self.groups) RuntimeError: Given groups=1, weight of size [16, 7, 3, 3], expected input[256, 16, 32, 32] to have 7 channels, but got 16 channels instead

Can you kindly help for this error? Thanks.

cattpku commented 5 years ago

I think I may find some clue for my Q1. In 'thinning.py' L328 and L357, it is trying to look for layers that the 'type' is 'Conv', 'Gemm' or 'BatchNormalization', but in 'model_summaries.py' L307, I found the 'type' is 'aten::_convolution' or 'aten::batch_norm', which means the 'type' values do not match each other.

nzmora commented 5 years ago

Hi @cattpku ,

The PR I just merged only contains a fix to the original problem that you reported.
I suspect that I know the source of the other 2 issues you have - (1) and (2) above, but I can't be sure.
Can you share the definition of your network? It will help for me to recreate the issue and understand it. If you can't share, then in a few days I will push a fix to the problem I see (which occurs in networks that have complex data-dependencies, such as in ResNet) and hopefully it will also fix your issue.

Cheers, Neta

cattpku commented 5 years ago

Hi Neta,

Thanks for your reply. For Q1, the network definition is here https://github.com/DrSleep/light-weight-refinenet/blob/master/models/mobilenet.py, and for Q2, I was using the 'resnet56_cifar', which is from the example.

nzmora commented 5 years ago

Hi @cattpku,

I couldn't use this network because it has some issues when I export it to ONNX. I think this has to do with pytorch version compatibilities, but I didn't fight it.

I tried recreating (2). I don't see the problem that you describe, but I see a different issue - this is the issue I am working on and is related to my discussion with @vinutah in issue #73. To make sure we are aligned, I'm attaching the YAML schedule for (2), which I created according to the instruction you provided above (I changed the ending to .txt from .yaml so that I can attach it).

resnet56_cifar_filter_rank_cattpku.txt

The error I get is this:

--- validate (epoch=180)-----------
5000 samples (256 per mini-batch)
==> Top1: 86.380    Top5: 99.360    Loss: 0.468

==> using cifar10 dataset
=> creating resnet56_cifar model for CIFAR10
Invoking create_thinning_recipe_channels
In tensor module.layer1.0.conv1.weight found 9/16 zero channels
In tensor module.layer1.1.conv1.weight found 9/16 zero channels
Traceback (most recent call last):
  File "compress_classifier.py", line 788, in <module>
    main()
  File "compress_classifier.py", line 407, in main
    compression_scheduler.on_epoch_end(epoch, optimizer)
  File "/home/cvds_lab/nzmora/sandbox_5/distiller/distiller/scheduler.py", line 161, in on_epoch_end
    policy.on_epoch_end(self.model, self.zeros_mask_dict, meta)
  File "/home/cvds_lab/nzmora/sandbox_5/distiller/distiller/thinning.py", line 374, in on_epoch_end
    self.thinning_func(model, zeros_mask_dict, self.arch, self.dataset, meta.get('optimizer', None))
  File "/home/cvds_lab/nzmora/sandbox_5/distiller/distiller/thinning.py", line 147, in remove_channels
    thinning_recipe = create_thinning_recipe_channels(sgraph, model, zeros_mask_dict)
  File "/home/cvds_lab/nzmora/sandbox_5/distiller/distiller/thinning.py", line 266, in create_thinning_recipe_channels
    assert len(bn_layers) == 1
AssertionError

This is caused by the complex data-dependencies that exist in ResNet, and I'm addressing this issue. I modified this notebook to show the dependencies and I'm attaching here (again, I change the file type to .txt from .ipynb).

pruning_channels_and_filters-cattpku.txt

The output looks like this:

cattpku2

The green oval is module whose channels we reduce; the red ovals are the modules (layers) that Distiller determined have a data-dependency and need to be changed. The "thinning" code doesn't currently handle these dependencies well.

Cheers Neta

cattpku commented 5 years ago

Hi Neta,

I also tried to export it to ONNX, and I failed too...

I just compared the YAML scheduler you sent with the one I used, they are exactly the same, but I reproduced the same error as my Q2. After debugging, I found the reason is the same as what I found for my Q1, that is: predecessors = sgraph.predecessors_f(normalize_module_name(layer_name), ['Conv']),
I need to change it to predecessors = sgraph.predecessors_f(normalize_module_name(layer_name), ['aten::_convolution']) same as bn_layers bn_layers = sgraph.predecessors_f(normalize_module_name(layer_name), ['aten::batch_norm'])

With such modification, I got the same error like you show above.

nzmora commented 5 years ago

This is strange - I don't understand at the moment why you are seeing this.

Can you send me your environment configuration? You will find this information in any of the log files - in the top ~15 lines. Thanks Neta

cattpku commented 5 years ago

It is weird, I use 'nvcc --version' to check the CUDA version, it is 'release 8.0, V8.0.61', but how come the log shows 'CUDA version: 9.0.176'?

2018-11-23 00:51:58,493 - Log file for this run: /home/chongyu/application/distiller/examples/classifier_compression/logs/2018.11.23-005158/2018.11.23-005158.log 2018-11-23 00:51:58,493 - Number of CPUs: 32 2018-11-23 00:51:58,516 - Number of GPUs: 1 2018-11-23 00:51:58,516 - CUDA version: 9.0.176 2018-11-23 00:51:58,516 - CUDNN version: 7102 2018-11-23 00:51:58,516 - Kernel: 4.15.0-36-generic 2018-11-23 00:51:58,516 - Python: 3.6.5 |Anaconda, Inc.| (default, Apr 29 2018, 16:14:56) [GCC 7.2.0] 2018-11-23 00:51:58,516 - PyTorch: 0.4.1 2018-11-23 00:51:58,516 - Numpy: 1.14.3 2018-11-23 00:51:58,531 - Git is dirty 2018-11-23 00:51:58,531 - Active Git branch: master 2018-11-23 00:51:58,538 - Git commit: ff6985adfc0e8b1b2a15a3b58d8675997f5e79d2 2018-11-23 00:51:58,538 - App args: ['/home/chongyu/application/distiller/examples/classifier_compression/compress_classifier.py'] 2018-11-23 00:51:58,539 - ==> using cifar10 dataset

nzmora commented 5 years ago

Can you switch for a second to pytorch 0.4.0 to see if you still see 'aten::_convolution' and 'aten::batch_norm'? I have a hunch Facebook touched this ONNX code. Thanks

cattpku commented 5 years ago

You are right, once I switch to pytorch 0.4.0, the 'type' changed to 'Conv' and 'BatchNormalization'. But I got some new error like: Unexpected key(s) in state_dict: "module.layer1.1.num_batches_tracked", "module.layer2.0.output.0.1.num_batches_tracked", ....... I have no idea where the 'num_batches_tracked' comes from, but once switch back to 0.4.1, no such error occurred.

nzmora commented 5 years ago

I think what you are seeing is related to this issue.

cattpku commented 5 years ago

Yes, thanks for your reminder. Btw, can I ask whether the filter-wise or channel-wise pruning support pruning group convolution layers? As I am currently testing these two kind of pruning on MobileNetV2 and ShuffleNetV2, both contain group convolution, after fixing the errors I reported above, the pruning example provided can run smoothly, but my own pruning always fail with such error: cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCTensorCopy.cpp:70, after debugging I found error in 'thinning.py' param.data = torch.index_select(param.data, dim, indices), the value of 'indices' is outbound of 'param.data' size. I am not sure why it happens, hope you can suggest. Thanks.

buttercutter commented 5 years ago

@cattpku Did you encounter https://github.com/NervanaSystems/distiller/issues/79#issuecomment-439948112 even after doing a git pull for the latest commit ?

cattpku commented 5 years ago

@promach I also met this problem with pytorch 0.4.1, and I compared the 'utils.py' for both 0.4.0 and 0.4.1, then I just comment out the parts 'if operator_export_type != OperatorExportTypes.RAW'

buttercutter commented 5 years ago

@cattpku I am not really sure if your quick hack in utils.py actually solves everything because it leads me to another error in functional.py which I am not familiar at all.

[phung@archlinux classifier_compression]$ time python compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10 --epochs=70 --lr=0.1 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml --resume=../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar -j=1 --deterministic Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.26-231318/2018.11.26-231318.log ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10


Logging to TensorBoard - remember to execute the server:

tensorboard --logdir='./logs'

=> loading checkpoint ../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar Checkpoint keys: compression_sched best_top1 optimizer state_dict epoch arch best top@1: 92.920 Loaded compression schedule from checkpoint (epoch 179) => loaded checkpoint '../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar' (epoch 179) Optimizer Type: <class 'torch.optim.sgd.SGD'> Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False} Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 Reading compression schedule from: ../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml

L1RankedStructureParameterPruner - param: module.layer1.0.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.1.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.2.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.3.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.4.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.5.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.6.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.7.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer1.8.conv1.weight pruned=0.688 goal=0.700 (11/16) L1RankedStructureParameterPruner - param: module.layer2.1.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.2.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.3.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.4.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.6.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer2.7.conv1.weight pruned=0.594 goal=0.600 (19/32) L1RankedStructureParameterPruner - param: module.layer3.1.conv1.weight pruned=0.188 goal=0.200 (12/64) L1RankedStructureParameterPruner - param: module.layer3.2.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.3.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.5.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.6.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.7.conv1.weight pruned=0.391 goal=0.400 (25/64) L1RankedStructureParameterPruner - param: module.layer3.8.conv1.weight pruned=0.391 goal=0.400 (25/64) Training epoch: 45000 samples (256 per mini-batch) ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10 Invoking create_thinning_recipe_filters In tensor module.layer1.0.conv1.weight found 11/16 zero filters In tensor module.layer1.1.conv1.weight found 11/16 zero filters In tensor module.layer1.2.conv1.weight found 11/16 zero filters In tensor module.layer1.3.conv1.weight found 11/16 zero filters In tensor module.layer1.4.conv1.weight found 11/16 zero filters In tensor module.layer1.5.conv1.weight found 11/16 zero filters In tensor module.layer1.6.conv1.weight found 11/16 zero filters In tensor module.layer1.7.conv1.weight found 11/16 zero filters In tensor module.layer1.8.conv1.weight found 11/16 zero filters In tensor module.layer2.1.conv1.weight found 19/32 zero filters In tensor module.layer2.2.conv1.weight found 19/32 zero filters In tensor module.layer2.3.conv1.weight found 19/32 zero filters In tensor module.layer2.4.conv1.weight found 19/32 zero filters In tensor module.layer2.6.conv1.weight found 19/32 zero filters In tensor module.layer2.7.conv1.weight found 19/32 zero filters In tensor module.layer3.1.conv1.weight found 12/64 zero filters In tensor module.layer3.2.conv1.weight found 25/64 zero filters In tensor module.layer3.3.conv1.weight found 25/64 zero filters In tensor module.layer3.5.conv1.weight found 25/64 zero filters In tensor module.layer3.6.conv1.weight found 25/64 zero filters In tensor module.layer3.7.conv1.weight found 25/64 zero filters In tensor module.layer3.8.conv1.weight found 25/64 zero filters Created, applied and saved a thinning recipe

Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.11.26-231318/2018.11.26-231318.log Traceback (most recent call last): File "compress_classifier.py", line 789, in main() File "compress_classifier.py", line 386, in main loggers=[tflogger, pylogger], args=args) File "compress_classifier.py", line 467, in train output = model(inputs) File "/usr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/usr/lib/python3.7/site-packages/torch/nn/parallel/data_parallel.py", line 124, in forward return self.module(*inputs[0], *kwargs[0]) File "/usr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/models/cifar10/resnet_cifar.py", line 140, in forward x = self.layer1(x) File "/usr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, kwargs) File "/usr/lib/python3.7/site-packages/torch/nn/modules/container.py", line 92, in forward input = module(input) File "/usr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(*input, *kwargs) File "/home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/models/cifar10/resnet_cifar.py", line 71, in forward out = self.bn1(out) File "/usr/lib/python3.7/site-packages/torch/nn/modules/module.py", line 477, in call result = self.forward(input, kwargs) File "/usr/lib/python3.7/site-packages/torch/nn/modules/batchnorm.py", line 67, in forward exponential_average_factor, self.eps) File "/usr/lib/python3.7/site-packages/torch/nn/functional.py", line 1349, in batch_norm training, momentum, eps, torch.backends.cudnn.enabled RuntimeError: running_mean should contain 5 elements not 16

real 0m7.277s user 0m6.356s sys 0m1.016s [phung@archlinux classifier_compression]$

cattpku commented 5 years ago

@promach it is also caused by 0.4.1, please check the posts above