NVIDIA / apex

A PyTorch Extension: Tools for easy mixed precision and distributed training in Pytorch
BSD 3-Clause "New" or "Revised" License
8.42k stars 1.4k forks source link

Slow Performance with "Exhaustive Search" Permutation Strategy for Channel Pruning in CNN #1826

Open Ulorewien opened 3 months ago

Ulorewien commented 3 months ago

Describe the Bug

I am using the Apex library to prune a CNN model on the CIFAR-100 dataset. My model has 5 convolutional layers with 128, 256, and 512 channels that can be pruned. According to the library's documentation, when the number of channels is less than or equal to 2048, the "exhaustive search" strategy is employed to maximize the accuracy of the structured sparse network.

However, when I ran the pruning process on a P100 GPU, it took over 8 hours and still wasn't complete. This seems unusually slow, especially considering that the total number of channels in my model is significantly below the 2048 threshold. Given the lengthy execution time, I suspect there might be an issue with the "exhaustive search" strategy or that the channel threshold of 2048 might be too high for practical use.

Suggestion

I recommend revisiting the threshold for using the "exhaustive search" strategy. A lower threshold, possibly around 16 or 32 channels, might be more appropriate and could prevent such long execution times. Alternatively, optimizing the "exhaustive search" strategy for models with a channel count close to 2048 could also improve performance.

Minimal Steps/Code to Reproduce the Bug

1. Use the CNN model defined below with the given hyperparameters. Model: ConvNet( (layers): Sequential( (0): Conv2d(3, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (2): ReLU() (3): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1)) (4): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (5): ReLU() (6): MaxPool2d(kernel_size=(3, 3), stride=2, padding=1, dilation=1, ceil_mode=False) (7): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (8): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (9): ReLU() (10): Conv2d(256, 256, kernel_size=(3, 3), stride=(1, 1)) (11): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (12): ReLU() (13): MaxPool2d(kernel_size=(3, 3), stride=2, padding=0, dilation=1, ceil_mode=False) (14): Dropout(p=0.25, inplace=False) (15): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)) (16): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (17): ReLU() (18): Conv2d(512, 512, kernel_size=(3, 3), stride=(1, 1)) (19): BatchNorm2d(512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True) (20): ReLU() (21): MaxPool2d(kernel_size=(3, 3), stride=1, padding=0, dilation=1, ceil_mode=False) (22): Dropout(p=0.25, inplace=False) (23): Flatten(start_dim=1, end_dim=-1) (24): Linear(in_features=2048, out_features=1024, bias=True) (25): ReLU() (26): Dropout(p=0.5, inplace=False) (27): Linear(in_features=1024, out_features=100, bias=True) ) ) Hyperparameters: n_epochs = 30 batch_size = 64 learning_rate = 1e-3 2. Use the CrossEntropy loss function and the AdamW optimizer. 3. Apply pruning using NVIDIA's Apex library on the model for CIFAR-100 dataset. 4. Run the pruning process on a P100 GPU. **Expected Behavior**

The pruning process should complete in a reasonable amount of time, especially with the "exhaustive search" strategy, given the total number of channels is significantly below the 2048 threshold.

Actual Behavior The pruning process takes an excessive amount of time (over 8 hours) and does not complete.

Environment

Python: 3.10.12 PyTorch: 2.3.1+cu121 GPU: NVIDIA P100 Dataset: CIFAR-100

Thank you for your attention to this issue. I look forward to any insights or suggestions you might have.