Optimize Inference Performance on CPU

dmlc / gluon-nlp

NLP made easy

https://nlp.gluon.ai/

Apache License 2.0

2.56k stars 538 forks source link

Optimize Inference Performance on CPU #1035

Open carter54 opened 4 years ago

carter54 commented 4 years ago

Description

the news in https://github.com/dmlc/gluon-nlp/releases/tag/v0.8.1 shows BERT int8 quantization is presented in blog https://medium.com/apache-mxnet/optimization-for-bert-inference-performance-on-cpu-3bb2413d376c But the blog only shows some results of BERT quantization test,

The work on low precision deployment is still ongoing and involves un-released SW, the reproduction instructions will be available later.

When will this work be released and can we apply this quantization method on GPT2?

Thanks a lot for the great work!

leezu commented 4 years ago

@TaoLv

TaoLv commented 4 years ago

Sorry for missing the message. We're working on cleaning the code and solution. Hope we can have a PR soon. I'm not familiar with the status of GPT2 in GluonNLP. Could you please point me to the scripts and whether it can be exported as a static model?

leezu commented 4 years ago

Yes, recently static GPT2 model is supported: https://github.com/dmlc/gluon-nlp/pull/1010

carter54 commented 4 years ago

Thanks for the replies. @leezu @TaoLv Looking forward to try int8 bert and gpt2 soon~

TaoLv commented 4 years ago

@carter54 FYI, here is the PR for BERT quantization: #1080

carter54 commented 4 years ago

@TaoLv Thx for the work, can this method be applied to GPT 2 model?