IntelLabs / distiller

Neural Network Distiller by Intel AI Lab: a Python package for neural network compression research. https://intellabs.github.io/distiller
Apache License 2.0
4.36k stars 800 forks source link

Automated Deep Compression status #64

Closed amjad-twalo closed 4 years ago

amjad-twalo commented 6 years ago

Hello there, I am wondering about the state of the ADC implementation, and what remains to bring it to a functional state. In the ADC merge commit message, you mentioned that it is still WiP and that it is using an unreleased version of Coach. Is that still the case? Also, is there any documentation for how to use ADC in Distiller?

Thanks

nzmora commented 6 years ago

Hi,

Currently the status of ADC (now AMC: https://arxiv.org/abs/1802.03494) is unchanged. I'll update when we have something that can be shared.

Cheers Neta

amjad-twalo commented 6 years ago

Thanks for the response :) As far as I looked, the implementation seems to be almost done. if the remaining work is clear and you're open for contributions, I can set aside some time to finish it up. I have been using distiller for a while now, and it saved me a lot of time. It would be awesome to have AMC up and running on it.

Cheers, Amjad

nzmora commented 6 years ago

Hi Amjad,

I'm happy to hear that you're using Distiller and find it useful! I'll be returning from Beijing in a couple of weeks and then I'll spend some time to synchronize Distiller with the public Coach APIs, and we can then see how to work together to get AMC working ASAP. I appreciate the help!

Cheers, Neta

amjad-twalo commented 5 years ago

Hey Neta, Any update regarding this? I think I will have some time to work on it in the next couple of weeks.

Cheers, Amjad

nzmora commented 5 years ago

Sorry Amjad, I still haven't completed the move to the public v.0.11.0 Coach. I'm currently pushing code that's still integrated with an older, private branch, of Coach.
I'll let you know as soon as I commit a version that can work with public Coach. Thanks, Neta

nzmora commented 5 years ago

Hi Amjad, I pushed a commit that integrates Distiller with the Coach master branch (requires one PR I pushed to Coach - see details in the Distiller commit). Currently only R_flops (AccuracyGuaranteed Compression) is enabled. It converges to a solution quickly after finishing the first 100 exploration episodes, but the converged solution is unsatisfactory. I tried it on Plain-20 and VGG16 - both for CIFAR. There are several open issues, which I won't enumerate right now - first, I need to try to better understand what's going on.

Cheers, Neta

HKLee2040 commented 5 years ago

@nzmora *NOTE: you may need to update TensorFlow to the expected version: $ pip3 install tensorflow==1.9.0

Does that mean I have to install cuda 9.0 if I want to try AMC?

nzmora commented 5 years ago

Hi @HKLee2040 No, installing TF 1.9.0 doesn't not require upgrading CUDA.

Cheers Neta

nzmora commented 5 years ago

See https://github.com/NervanaSystems/distiller/blob/amc/examples/automated_deep_compression/amc-results.ipynb.

Work on AMC currently takes place in branch 'amc'. Your help is more than welcome. Cheers Neta

nzmora commented 5 years ago

After switching to using Clipped PPO I'm getting very encouraging results. See: https://github.com/NervanaSystems/distiller/wiki/AutoML-for-Model-Compression-(AMC):-Trials-and-Tribulations

huxianer commented 5 years ago

@nzmora @nzmora Could you share plain20.checkpoint.pth.tar,Thanks!

nzmora commented 5 years ago

@huxianer the schedule file for training Plain20 is here. It took me about 33 minutes on 4-GPUs.

However, since you've asked :-), I've also uploaded the image here: https://drive.google.com/file/d/1bBhjjxkXjFHmqfTWKnxop3n6QCN8QfZJ/view?usp=sharing

Cheers, Neta

huxianer commented 5 years ago

@nzmora Thank you very much! I have another question to ask you,I found the top1 performance is really unchanged when I dont use the pretrained model.So, if I dont have the pretrained model,what
can I do?

nzmora commented 5 years ago

Hi @huxianer, I am not sure I understood your question, so I will answer according to what I understood.

I think you are asking how to train using AMC if we don't have a pre-trained model of the network we are compressing.
The answer is that you must have a pre-trained model because "We aim to automatically find the redundancy for each layer, characterized by sparsity. We train an reinforcement learning agent to predict the action and give the sparsity, then perform form the pruning. We quickly evaluate the accuracy after pruning but before fine-tuning as an effective delegate of final accuracy" (section 3, page 4). You can only "find the redundancy for each layer" if you are searching a pre-trained model. If the model is not trained, you cannot find any redundancy because the weights do not have any meaning (they are randomly distributed).

I hope this helps, Neta

HKLee2040 commented 5 years ago

Why the smooth_top1 and smooth_reward are overlapping in my "Performance Data" diagram? I have some modifications: Due to only one GPU in my environment, so I modify "conv_op = g.find_op(normalize_module_name(name))" to "conv_op = g.find_op(name)".

And args.amc_target_density = None, so I add args.amc_target_density = 0.5; in my code.

nzmora commented 5 years ago

Hi @HKLee2040

I have some modifications:

I will need to fix the code for the case of one GPU.

Why the smooth_top1 and smooth_reward are overlapping in my "Performance Data" diagram?

I don't know which protocol you are using ("mac-constrained" or "accuracy-guaranteed"), but both are highly correlated to the Top1 accuracy: image image

So it makes sense that you will see an overlap when the graphs are smoothed (I smoothed using a simple moving average) because the signal noise is made less noticeable in both the reward and accuracy signals. You can see an example here.

Having said that, I think that you ask a good question. I think that this is a clue as to why the reward defined in the AMC paper, for accuracy-guaranteed-compression, is not so good. The solutions converge on maximum density for all layers (you can see this in the green bars here) - probably because the agent tries to maximize the Top1 accuracy - and not enough weight is given to the MACs (FLOPs) in the reward (5). This is my conjecture at the moment.

Thanks, Neta

nzmora commented 5 years ago

Hi @HKLee2040,

My protocol is "mac-constrained". The reward fn should be top1/100. But why the blue and green line your Performance Data are so different?

Thanks for the persistency. The shift you see is an illusion (and causes confusion, I guess) and is caused by the fact that the reward and Top1 accuracy use different axes (top1 on the right; reward on the left). The reward's range is [0..1] and the accuracy is [0..100] and because their values are correlated exactly (reward = 1/100 as you wrote above) they should align. However, when we draw the MAC values, also on the left axis, they distort the relativity of the axes (they shift relative to one another). You can see this if you disable the rendering of the MACs graphs, or if you set the ylim of the axes. For example:

def plot_performance(alpha, window_size, top1, macs, params, reward, start=0, end=-1):
    plot_kwargs = {"figsize":(15,7), "lw": 1, "alpha": alpha, "title": "Performance Data"}
    smooth_kwargs = {"lw": 2 if window_size > 0 else 1, "legend": True}
    if macs:
        ax = df['normalized_macs'][start:end].plot(**plot_kwargs, color="r")
        ax.set(xlabel="Episode", ylabel="(%)", ylim=[0,100])
        df['smooth_normalized_macs'] = smooth(df['normalized_macs'], window_size)
        df['smooth_normalized_macs'][start:end].plot(**smooth_kwargs, color="r")
    if top1:
        ax = df['top1'][start:end].plot(**plot_kwargs, color="b", grid=True)
        ax.set(xlabel="Episode", ylabel="(%)", ylim=[0,100])
        df['smooth_top1'] = smooth(df['top1'], window_size)
        df['smooth_top1'][start:end].plot(**smooth_kwargs, color="b")
    if params:
        ax = df['normalized_nnz'][start:end].plot(**plot_kwargs, color="black")
        ax.set(xlabel="Episode", ylabel="(%)", ylim=[0,100])
        df['smooth_normalized_nnz'] = smooth(df['normalized_nnz'], window_size)
        df['smooth_normalized_nnz'][start:end].plot(**smooth_kwargs, color="black")        
    if reward:
        ax = df['reward'][start:end].plot(**plot_kwargs, secondary_y=True, color="g")
        ax.set(xlabel="Episode", ylabel="reward", ylim=[0,1.0])
        df['smooth_reward'] = smooth(df['reward'], window_size)
        df['smooth_reward'][start:end].plot(**smooth_kwargs, secondary_y=True, color="g")    
    ax.grid(True, which='minor', axis='x', alpha=0.3)

I uploaded my raw log files to here and you can load and try them.

Still, you ask why for you the graphs overlap and for me they don't. This is because, in my files, the big drop in the MACs (at episode 3474; to ~5%) causes the left and right axes to shift and they become unaligned.

Cheers Neta

HKLee2040 commented 5 years ago

Hi @nzmora

Got it! It's my carelessness. I didn't check the scale of axes. Thanks for your detailed reply.

HKLee2040 commented 5 years ago

Hi @nzmora

May I know why you set pi_lr = 1e-4, q_lr = 1e-3 in ddpg? Do you refer to arXiv:1811.08886, where they use a fixed learning rate of 1e−4 for the actor network and 1e−3 for the critic network.

    ddpg.ddpg(env=env1, test_env=env2, actor_critic=core.mlp_actor_critic,
              ac_kwargs=dict(hidden_sizes=[hid]*layers, output_activation=tf.sigmoid),
              gamma=1,  # discount rate
              seed=seed,
              epochs=400,
              replay_size=2000,
              batch_size=64,
              start_steps=env1.amc_cfg.num_heatup_epochs,
              steps_per_epoch=800 * env1.num_layers(),  # every 50 episodes perform 10 episodes of testing
              act_noise=0.5,
              pi_lr=1e-4,
              q_lr=1e-3,
              logger_kwargs=logger_kwargs)
nzmora commented 5 years ago

Hi @HKLee2040, I got these numbers from the DDPG paper Continuous control with deep reinforcement learning. Cheers Neta

huxianer commented 5 years ago

@nzmora Hi,How do you get the YAML file of pruning schedule,Could you share the pruning schedule YAML file of resnet trained in IMAGENET,THKS!

nzmora commented 5 years ago

Hi @huxianer, I'm not sure I understand which YAML file you refer to. AMC/ADC currently works w/o YAML. There are some sample YAML files using other techniques. For example AGP. Cheers Neta

huxianer commented 5 years ago

@nzmora @HKLee2040 I refer to every YAML file,here give it directly,but it does not say how to get it.You say AMC/ADC currently works w/o YAML,could you give an example which without YAML file,Thank you for your help!

HKLee2040 commented 5 years ago

Hi @huxianer

You can refer to nzmora's message https://github.com/NervanaSystems/distiller/issues/64#issuecomment-451766455

The command-line is: python3 compress_classifier.py --arch=plain20_cifar ../../../data.cifar --amc --resume=checkpoint.plain20_cifar.pth.tar --lr=0.05 --amc-action-range 0.0 0.80 --vs=0.8

huxianer commented 5 years ago

@nzmora Hi,whether this Distiller supports detection model,and if not,do you have any intention to support it?

RizhaoCai commented 5 years ago

I am also interested in using AMC for detection models. How about the progress now?

nzmora commented 5 years ago

Hi @huxianer , @RizhaoCai ,

I merged the revised AMC implementation to 'master'. You can now try our auto-compression code.
I'll add more information on the setup soon.

It currently doesn't support object detection. @levzlotnik is working on adding an example of object detection, after which we will consider automating. If you happen to integrate object-detection with AMC, we'd be interested in considering it for integration into the Distiller code-base. Cheers, Neta

Cheers Neta

wangyidong3 commented 5 years ago

Hi @levzlotnik @nzmora Thank you for your great work! Is there any update for the example of object detection with AMC? Or do you have any suggestions? Thanks.