PaddlePaddle / Paddle

PArallel Distributed Deep LEarning: Machine Learning Framework from Industrial Practice (『飞桨』核心框架,深度学习&机器学习高性能单机、分布式训练和跨平台部署)
http://www.paddlepaddle.org/
Apache License 2.0
22.31k stars 5.63k forks source link

Adding Kurtosis regularizer as option at user level (python interface) #28522

Closed andreazanetti closed 3 years ago

andreazanetti commented 4 years ago

Hi, this is a request for information about whether or not you are working on supporting Kurtosis regularizer natively in the Paddle user interface. This regularizer has the goal to push weight tensors (and/or activation tensors) to have a more uniform distribution, which is ideal for later quantization of the trained model. We think about the QAT/Quant pipeline. The kurtosis regularizer is introduced here: https://arxiv.org/pdf/2002.07686.pdf To implement the regularizer one has, in short, to extract the weight tensors and all activations from the training graph at each iteration, compute the kurtosis for each tensor, take the mean and add this to the total loss function to minimize. Adding this regularizer requires some code changes in the model, but it might be natively supported by the framework and enabled with something like Kure_reg(on_weights=True, on_activations=False). Kindly please let us know either if you are working on it, or contributions around it are welcome. Thanks

paddle-bot-old[bot] commented 4 years ago

您好,我们已经收到了您的问题,会安排技术人员在一天之内解答您的疑惑,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. The average response time is expected to be with in one day. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

wzzju commented 4 years ago

Hi, this is a request for information about whether or not you are working on supporting Kurtosis regularizer natively in the Paddle user interface. This regularizer has the goal to push weight tensors (and/or activation tensors) to have a more uniform distribution, which is ideal for later quantization of the trained model. We think about the QAT/Quant pipeline. The kurtosis regularizer is introduced here: https://arxiv.org/pdf/2002.07686.pdf To implement the regularizer one has, in short, to extract the weight tensors and all activations from the training graph at each iteration, compute the kurtosis for each tensor, take the mean and add this to the total loss function to minimize. Adding this regularizer requires some code changes in the model, but it might be natively supported by the framework and enabled with something like Kure_reg(on_weights=True, on_activations=False). Kindly please let us know either if you are working on it, or contributions around it are welcome. Thanks

Hi! Thanks for your suggestion. You are welcome to contribute to PaddlePaddle about Kurtosis Regularizer. When you make the pull request, please let me know and add me as a reviewer. Feel free to ask me any question!

andreazanetti commented 4 years ago

Close this issue for now. Will open a new one once we have something ready on this topic. Thanks.

paddle-bot-old[bot] commented 4 years ago

Are you satisfied with the resolution of your issue?

YES No

paddle-bot-old[bot] commented 4 years ago

Are you satisfied with the resolution of your issue?

YES No

andreazanetti commented 3 years ago

Hi, reopening this issue to continue the discussion as we would like to:

  1. share some data from experiments
  2. get your feedback about a specific proposal we have.

1. share some data from experiments We spent some time studying how to apply Kurtosis regularizer to pre-trained attention-based models like Ernie for sentiment classification, which has commercial relevance. Applying this kind of regularizer to pretrained models appears to be a bit more complex than the case for models trained from scratch as some tensor distributions at the end of pretraining can be such that the kurtosis statistics among the considered set of tensors can span several order of magnitudes.

In our case we focussed on Ernie and for training we worked with single GPU with 6077MiB of memory. Due to this limitation we had to run our experiments with BS=20, whereas Ernie results from Baidu we have seen were generated with BS=24. However, we are under the impression that the general indication of the results (that is Kurtosis seems to be beneficial for int8 quantization of pretrained models like Ernie) still applies. Results:

Screenshot 2021-03-04 at 13 39 20

The generation of the results was done using the script: /home/shared/Paddle/python/paddle/fluid/contrib/slim/tests/quant2_int8_nlp_comparison.py

The pic above was generated introducing a multiplicative noise on the scales used for the quantized weight tensors. This was done introducing multiplicative noise to weight scales in quant2_int8_mkldnn_pass.py.

The procedure was:

starting with no-noise (std = 0)

  1. evaluate the int8/QAT validation accuracy of Ernie with and without Kurtosis regularizer *
  2. collect data,
  3. increase std of noise and update weight tensor scales
  4. repeat

.* Kurtosis regularizer was computed excluding the weight tensors of the 2nd layer of each fully connected block at the top of every attention block, since these tensors turn out to have such a distribution after pretraining that makes Kurtosis too high, so we left them untouched (out of regularizer computation)

With our surprise we noticed that within certain limits (to which the chart above is restricted) we see that the int8 accuracy can increase with noise added to the weight tensors scale. We wonder what the origin of it could be. For consistency, we are still looking into our process to find a potential flaw/reason but it might also be that during QAT-int8 conversion, a suboptimal scale is estimated.

2. - get your feedback about a specific proposal we have. Above, we reported results of some experiments with Ernie for sentiment classification quantized using QAT approach and Kurtosis regularizer in ad-hoc way based on tensor distribution after Ernie pre-training. We are under the impression there is the potential to make the positive effect of Kurtosis regularizer easy to use for final users, but this require further study and clarification of some points which are still unclear, that is the one mentioned above about applying noise to scales. Would you be interested to jointly study this case to check if it is possible to extend the python API to allow user to leverage Kurtosis regularizer effect for int8 quantization in very easy-to-use way?

lidanqing-intel commented 3 years ago

@wzzju @juncaipeng @luotao1
@andreazanetti 做了一些实验,结果显示添加kurtosis 对于fake quant 模型训练可以加快收敛且精度有提升。图表中绿色线条为fake quant training with kurtosis, 黄色线条为without kurtosis,具体参数见上文。他想问你们有什么建议吗?他可以把这个加 Kurtosis 的选项加入python fake quant训练 API吗?如果能加到API,要加到哪里呢?

如果上面英文部分哪里不清楚,这边可以做更多说明

juncaipeng commented 3 years ago

好的,我研读一下这个论文,看下如何加这个api。

andreazanetti commented 3 years ago

Thank you. The paper applies Kurtosis regularizer mainly to vision models, and in any case neither to pre-trained nor to attention-based models (like Ernie), as far as we understand it. The work we have done around this tries to apply Kurtosis regularizer to Ernie for sentiment classification. In order to do this, the Kurtosis regularizer cannot be applied out-of-the-box as some weight tensors after Ernie pre-training have extremely high Kurtosis than the other ones. The overall idea is to study the weight distributions after pre-training and apply Kurtosis (=make them more uniform/less peaked) only to the least critical weight tensors during QAT fine-tuning. This should make the int8 quantization less impactful on the final accuracy, as shown in the pic above.

juncaipeng commented 3 years ago

@andreazanetti
In my opinion, you can refer to L2 regularization and add the Kurtosis regularizer to Paddle. When users define the model, they apply the Kurtosis regularizer before using QAT.

As for the procedure described above, I don't the reason of adding noise, can you explain the details?

andreazanetti commented 3 years ago

@juncaipeng Thank you very much for looking into it and share your suggestions! It seems very sensible to add Kurtosis support "copying" the way L2 regularizer is implemented, and changing only the computation of the loss component. I hope I understand your suggestion in the right way :).

However, for pretrained models like Ernie some weight tensors tend to have, just after Ernie pretraining, extremely high Kurtosis and this makes the application of it a bit more complex and not exactly matching the way it is presented in the reference paper mentioned above. From the user perspective, it would be nice not to be forced to know exactly which weight tensors are critical and which ones are can be regularized. Thus, it would be nice to add this support on a per-model base, although this looks like a second step, just in case. The first step is surely like you suggest (if I understand it correctly): take L2 regularization as model and make other object called Kurtosis_reg that does the job.

With regards adding noise to scales: in the reference paper, in figure4, pag6 they show the robustness of the accuracy with respect the quantization step in case of uniform quantization. The charts show that with Kurtosis regularizer even if we do not pick the best quantization interval we are end up having a more stable int8 accuracy. In order to investigate this aspect in our current QAT procedure, we inserted multiplicative noise ( scale*N(1, std) ) on the scales estimated for the weight tensors, to "simulate" in a simple way the event of picking the wrong quantization interval. With our surprise we noted that most of the time std=0 (so no noise) leads to worse int8 accuracy than std > 0 but still relatively small (say 0.3). This raised the question: is the procedure in use optimal with respect the scales to use for weight tensors in the QAT process?

juncaipeng commented 3 years ago

@andreazanetti Thank you for your contribution to Paddle!

It seems very sensible to add Kurtosis support "copying" the way L2 regularizer is implemented, and changing only the computation of the loss component. I hope I understand your suggestion in the right way :).

Yes, it is the first step.

However, for pretrained models like Ernie some weight tensors tend to have, just after Ernie pretraining, extremely high Kurtosis and this makes the application of it a bit more complex and not exactly matching the way it is presented in the reference paper mentioned above. From the user perspective, it would be nice not to be forced to know exactly which weight tensors are critical and which ones are can be regularized.

For any models, is there any way to automatically select weights that are not suitable for applying Kurtosis regularizer? If the weights have high Kurtosis, are them not suitable for applying Kurtosis regularizer?

From the Paddle framework perspective, the Kurtosis regularizer to be added should be as simple as L2 regularizer. So, it is up to the users or PaddleSlim to decide that certain weights to be applied Kurtosis regularizer.

The simplest way: the users firstly define an param_attr that use Kurtosis regularizer, and then set param_attr as input params for conv2d or fc, of which the weights are applied Kurtosis regularizer.

andreazanetti commented 3 years ago

For any models, is there any way to automatically select weights that are not suitable for applying Kurtosis regularizer? If the weights have high Kurtosis, are them not suitable for applying Kurtosis regularizer?

In principle Kurtosis regularizer could be applied in any case, but there happen to be cases in which all the involved tensors have Kurtosis within the same order of magnitude (e.g. at the start fo training, when initial weights are usually set with well-known distributions), and others in which the involved tensors have kurtosis values that span different order of magnitudes (e.g. after pretraining like in Ernie/Bert case). In the latter case the simplest approach I can think of it is just to leave those "critical" tensors out of Kurtosis regularization, but other strategies are possible I guess. For example, one strategy could be to compute the Kurtosis for all tensors, then the mean and the variance/std of the various Kurtosis values, and exclude from the regularizer those tensors which have Kurtosis not in [-3std + mean, mean+ 3 std], in the naive hyp that Kurtosis values are normally distributed.

To start small, as first step I will try to add here a version of Kurtosis regularizer that acts in a similar fashion as L2 regularizer. However, I need to ask some questions. Apologies in advance, as I feel those are dummy questions:

a) why L2DecayRegularizer is computed as a scale? I expected the parameter to be squared and then multiplied by the coefficient. A similar question I have for the L1 regularizer (same file), where it is returned coef * sign(params). Maybe this could be an ask for a different issue but also raises doubts about how to insert Kurtosis in the "picture"

b) to create support for Kurtosis regularizer, I guess I should add the appropriate ops to the graph. I see here that the ops for L2 regularizer, to which I am referring to, are added via block.append_op(...). Do you recommend to frame the Kurtosis computation using operators created via block.append_op(...)? Being Kurtosis (T) = E [(( T - mean_T)/std_T)^4] I was wondering if there could be a simpler way (maybe just using fluid). Would that make sense? Thanks

andreazanetti commented 3 years ago

adding to point b) above: I see 3 possible ways to do it. With reference to the following code taken from the __call__(self, param, grad, block): method in L2DecayRegularizer class in file regularizer.py

    inputs = {"X": [param]}
    attrs = {"scale": self._regularization_coeff}

    if framework.in_dygraph_mode():
        return core.ops.scale(param, "scale", self._regularization_coeff)
    else:
        decay = block.create_var(
            dtype=param.dtype, shape=param.shape, lod_level=param.lod_level)

        # Append Op to calculate decay
        block.append_op(
            type='scale',
            inputs={"X": param},
            outputs={"Out": decay},
            attrs={"scale": self._regularization_coeff})

1) add Kurtosis as basic operation for a parameter at C++ level, and make the computation of the Kurtosis for the passed parameter param just an operation at graph level, either dynamic or static

2) add all the operations needed to compute the Kurtosis of the passed parameter param using either core.ops. or block.append_op(type=.....) according to the type of graph in use framework.in_dygraph_mode() is True or False.

3) add the Kurtosis operation for the passed parameter param using fluid interface. This seems a bit out of path as here in the __call__(self, param, grad, block): method we received block and grad, but it might be ok, and easier to put together.

It would be great if you could share your thoughts about these 3 points. Thanks!!

PS: With regards point a) above - why L2 and L1 regularization are computed like they are now in Paddle (see here and here) - I think I will open a new issue.

paddle-bot-old[bot] commented 3 years ago

Are you satisfied with the resolution of your issue?

YES No