IntelLabs / distiller

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
Apache License 2.0
4.35k stars 802 forks source link

Cannot allocate memory #112

Closed buttercutter closed 5 years ago

buttercutter commented 5 years ago

Any advice on how to go around the following memory allocation error ? When I googled this error , I found https://github.com/facebookresearch/DrQA/issues/53 but I do not think the fix in that issue could be applied in distiller though.

[phung@archlinux classifier_compression]$ time python compress_classifier.py -a=resnet56_cifar -p=50 ../../../data.cifar10 --epochs=70 --lr=0.1 --compress=../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml --resume=../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar -j=1 --deterministic Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.12.29-084750/2018.12.29-084750.log ==> using cifar10 dataset => creating resnet56_cifar model for CIFAR10


Logging to TensorBoard - remember to execute the server:

tensorboard --logdir='./logs'

=> loading checkpoint ../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar Checkpoint keys: compression_sched best_top1 optimizer state_dict epoch arch best top@1: 92.920 Loaded compression schedule from checkpoint (epoch 179) => loaded checkpoint '../pruning_filters_for_efficient_convnets/checkpoints/checkpoint.resnet56_cifar_baseline.pth.tar' (epoch 179) Optimizer Type: <class 'torch.optim.sgd.SGD'> Optimizer Args: {'lr': 0.1, 'momentum': 0.9, 'dampening': 0, 'weight_decay': 0.0001, 'nesterov': False} Files already downloaded and verified Files already downloaded and verified Dataset sizes: training=45000 validation=5000 test=10000 Reading compression schedule from: ../pruning_filters_for_efficient_convnets/resnet56_cifar_filter_rank_v2.yaml

L1RankedStructureParameterPruner - param: module.layer1.0.conv1.weight pruned=(11/16) L1RankedStructureParameterPruner - param: module.layer1.0.conv1.weight pruned=0.688 goal=0.700 L1RankedStructureParameterPruner - param: module.layer1.1.conv1.weight pruned=(11/16) L1RankedStructureParameterPruner - param: module.layer1.1.conv1.weight pruned=0.688 goal=0.700 L1RankedStructureParameterPruner - param: module.layer1.2.conv1.weight pruned=(11/16) L1RankedStructureParameterPruner - param: module.layer1.2.conv1.weight pruned=0.688 goal=0.700 L1RankedStructureParameterPruner - param: module.layer1.3.conv1.weight pruned=(11/16) L1RankedStructureParameterPruner - param: module.layer1.3.conv1.weight pruned=0.688 goal=0.700 L1RankedStructureParameterPruner - param: module.layer1.4.conv1.weight pruned=(11/16) L1RankedStructureParameterPruner - param: module.layer1.4.conv1.weight pruned=0.688 goal=0.700 L1RankedStructureParameterPruner - param: module.layer1.5.conv1.weight pruned=(11/16) L1RankedStructureParameterPruner - param: module.layer1.5.conv1.weight pruned=0.688 goal=0.700 L1RankedStructureParameterPruner - param: module.layer1.6.conv1.weight pruned=(11/16) L1RankedStructureParameterPruner - param: module.layer1.6.conv1.weight pruned=0.688 goal=0.700 L1RankedStructureParameterPruner - param: module.layer1.7.conv1.weight pruned=(11/16) L1RankedStructureParameterPruner - param: module.layer1.7.conv1.weight pruned=0.688 goal=0.700 L1RankedStructureParameterPruner - param: module.layer1.8.conv1.weight pruned=(11/16) L1RankedStructureParameterPruner - param: module.layer1.8.conv1.weight pruned=0.688 goal=0.700 L1RankedStructureParameterPruner - param: module.layer2.1.conv1.weight pruned=(19/32) L1RankedStructureParameterPruner - param: module.layer2.1.conv1.weight pruned=0.594 goal=0.600 L1RankedStructureParameterPruner - param: module.layer2.2.conv1.weight pruned=(19/32) L1RankedStructureParameterPruner - param: module.layer2.2.conv1.weight pruned=0.594 goal=0.600 L1RankedStructureParameterPruner - param: module.layer2.3.conv1.weight pruned=(19/32) L1RankedStructureParameterPruner - param: module.layer2.3.conv1.weight pruned=0.594 goal=0.600 L1RankedStructureParameterPruner - param: module.layer2.4.conv1.weight pruned=(19/32) L1RankedStructureParameterPruner - param: module.layer2.4.conv1.weight pruned=0.594 goal=0.600 L1RankedStructureParameterPruner - param: module.layer2.6.conv1.weight pruned=(19/32) L1RankedStructureParameterPruner - param: module.layer2.6.conv1.weight pruned=0.594 goal=0.600 L1RankedStructureParameterPruner - param: module.layer2.7.conv1.weight pruned=(19/32) L1RankedStructureParameterPruner - param: module.layer2.7.conv1.weight pruned=0.594 goal=0.600 L1RankedStructureParameterPruner - param: module.layer3.2.conv1.weight pruned=(25/64) L1RankedStructureParameterPruner - param: module.layer3.2.conv1.weight pruned=0.391 goal=0.400 L1RankedStructureParameterPruner - param: module.layer3.3.conv1.weight pruned=(25/64) L1RankedStructureParameterPruner - param: module.layer3.3.conv1.weight pruned=0.391 goal=0.400 L1RankedStructureParameterPruner - param: module.layer3.5.conv1.weight pruned=(25/64) L1RankedStructureParameterPruner - param: module.layer3.5.conv1.weight pruned=0.391 goal=0.400 L1RankedStructureParameterPruner - param: module.layer3.6.conv1.weight pruned=(25/64) L1RankedStructureParameterPruner - param: module.layer3.6.conv1.weight pruned=0.391 goal=0.400 L1RankedStructureParameterPruner - param: module.layer3.7.conv1.weight pruned=(25/64) L1RankedStructureParameterPruner - param: module.layer3.7.conv1.weight pruned=0.391 goal=0.400 L1RankedStructureParameterPruner - param: module.layer3.8.conv1.weight pruned=(25/64) L1RankedStructureParameterPruner - param: module.layer3.8.conv1.weight pruned=0.391 goal=0.400 L1RankedStructureParameterPruner - param: module.layer3.1.conv1.weight pruned=(12/64) L1RankedStructureParameterPruner - param: module.layer3.1.conv1.weight pruned=0.188 goal=0.200 Training epoch: 45000 samples (256 per mini-batch)

Log file for this run: /home/phung/Documents/Grive/Personal/Coursera/Machine_Learning/pruning/distiller/examples/classifier_compression/logs/2018.12.29-084750/2018.12.29-084750.log Traceback (most recent call last): File "compress_classifier.py", line 816, in main() File "compress_classifier.py", line 405, in main loggers=[tflogger, pylogger], args=args) File "compress_classifier.py", line 476, in train for train_step, (inputs, target) in enumerate(train_loader): File "/usr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 819, in iter return _DataLoaderIter(self) File "/usr/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 560, in init w.start() File "/usr/lib/python3.7/multiprocessing/process.py", line 112, in start self._popen = self._Popen(self) File "/usr/lib/python3.7/multiprocessing/context.py", line 223, in _Popen return _default_context.get_context().Process._Popen(process_obj) File "/usr/lib/python3.7/multiprocessing/context.py", line 277, in _Popen return Popen(process_obj) File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 20, in init self._launch(process_obj) File "/usr/lib/python3.7/multiprocessing/popen_fork.py", line 70, in _launch self.pid = os.fork() OSError: [Errno 12] Cannot allocate memory

real 0m15.349s user 0m5.542s sys 0m1.821s [phung@archlinux classifier_compression]$

[phung@archlinux classifier_compression]$ nvidia-smi
Sat Dec 29 08:49:23 2018       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 415.25       Driver Version: 415.25       CUDA Version: 10.0     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 950M    Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   44C    P8    N/A /  N/A |      0MiB /  4046MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+
[phung@archlinux classifier_compression]$
nzmora commented 5 years ago

Hi @promach,

Does this memory problem occur consistently? I see that you are using only one worker thread and the CIFAR dataset, which means that the load on the memory should not be that great. Have you checked what else is running and what process is consuming your system's memory?

We haven't experienced this problem, and it appears to be local to your machine, so there's not much that we can help here I'm afraid.

Cheers, Neta

nzmora commented 5 years ago

Hi @promach,

I noticed that some Jupyter notebooks consume a lot of memory, which is not released until the Jupyter kernels are shut down. I didn't debug which notebooks, or why they leak memory. The temporary solution is to shutdown the notebooks once you are done with them.

Cheers, Neta