dmlc / gluon-nlp

NLP made easy
https://nlp.gluon.ai/
Apache License 2.0
2.56k stars 538 forks source link

Deploy BERT model - Script #1237

Closed MoisesHer closed 4 years ago

MoisesHer commented 4 years ago

Description

Includes an script to deploy BERT for QA / classification / regression / embedding tasks It offers the possibility of using available GPU BERT optimizations on MXNet. It reports latency and throughput, and can check accuracy.

Checklist

Essentials

Changes

Comments

cc @dmlc/gluon-nlp-team

mli commented 4 years ago

Job PR-1237/1 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/1/index.html

mli commented 4 years ago

Job PR-1237/2 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/2/index.html

mli commented 4 years ago

Job PR-1237/4 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/4/index.html

mli commented 4 years ago

Job PR-1237/3 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/3/index.html

mli commented 4 years ago

Job PR-1237/5 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/5/index.html

codecov[bot] commented 4 years ago

Codecov Report

Merging #1237 into master will increase coverage by 0.03%. The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #1237      +/-   ##
==========================================
+ Coverage   87.42%   87.45%   +0.03%     
==========================================
  Files          81       81              
  Lines        7346     7365      +19     
==========================================
+ Hits         6422     6441      +19     
  Misses        924      924              
Impacted Files Coverage Δ
src/gluonnlp/model/bert.py 94.65% <0.00%> (+0.03%) :arrow_up:
src/gluonnlp/model/transformer.py 91.71% <0.00%> (+0.05%) :arrow_up:
src/gluonnlp/model/language_model.py 98.64% <0.00%> (+0.15%) :arrow_up:
mli commented 4 years ago

Job PR-1237/6 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/6/index.html

mli commented 4 years ago

Job PR-1237/7 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/7/index.html

mli commented 4 years ago

Job PR-1237/8 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/8/index.html

mli commented 4 years ago

Job PR-1237/9 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/9/index.html

MoisesHer commented 4 years ago

I assume the graph pass requires mxnet nightly build? Would it make sense to mention the minimum mxnet version required for this script in the doc?

Yes, I have added a comment in index.rst for TrueFP16 and custom pass optimizations: " These GPU optimizations require MXNet version 1.7 or higher"

mli commented 4 years ago

Job PR-1237/10 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/10/index.html

mli commented 4 years ago

Job PR-1237/11 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/11/index.html

chenw23 commented 4 years ago

I think maybe there is some dependency error in this code, for the error log below:


[2020-06-20T22:33:01.774Z]   In file included from horovod/mxnet/mpi_ops.h:23:0,
[2020-06-20T22:33:01.774Z]                    from horovod/mxnet/mpi_ops.cc:20:
[2020-06-20T22:33:01.774Z]   /var/lib/jenkins/workspace/gluon-nlp-gpu-py3-master@2/conda/gpu/py3-master/lib/python3.5/site-packages/mxnet/include/mxnet/ndarray.h:41:10: fatal error: mkldnn.hpp: No such file or directory
[2020-06-20T22:33:01.774Z]    #include <mkldnn.hpp>
[2020-06-20T22:33:01.774Z]             ^~~~~~~~~~~~
[2020-06-20T22:33:01.774Z]   compilation terminated.
[2020-06-20T22:33:01.774Z]   error: command 'gcc' failed with exit status 1
[2020-06-20T22:33:01.774Z]   ----------------------------------------
[2020-06-20T22:33:01.774Z]   ERROR: Failed building wheel for horovod

Maybe some necessary files need to be installed or included?

mli commented 4 years ago

Job PR-1237/12 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/12/index.html

MoisesHer commented 4 years ago

@MoisesHer it looks like bertpass_lib.so is not built Compilation is triggered here https://github.com/dmlc/gluon-nlp/pull/1237/files#diff-fa82d34d543ff657c2fe09553bd0fa34R433

Locally it works on my side, I think the problem is with conda. Maybe it is not storing the library within the expected path? or not triggering the compilation? What would be the best way to reproduce conda environment?

mli commented 4 years ago

Job PR-1237/13 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/13/index.html

leezu commented 4 years ago

@MoisesHer you can refer to https://github.com/dmlc/gluon-nlp/blob/master/ci/prepare_clean_env.sh regarding the conda setup

MoisesHer commented 4 years ago

Do you know why I am getting this lint error? TypeError: '<' not supported between instances of 'str' and 'NoneType' I cannot reproduce locally. I tried to install miniconda3 and set environment as Leezu suggested (https://github.com/dmlc/gluon-nlp/blob/master/ci/prepare_clean_env.sh), but I cannot reproduce it locally and it does not give any information of what line produces that

MoisesHer commented 4 years ago

@leezu thanks a lot for your help. That allowed me to make some progress. However, it seems that the lib_api.h being imported it is not the one contained in the wheel I am using here (https://repo.mxnet.io/dist/python/cu100/mxnet_cu100-1.7.0b20200809-py2.py3-none-manylinux2014_x86_64.whl). the one included at compilation does not include JsonVal structure, but it is in the wheel (https://github.com/apache/incubator-mxnet/blob/v1.7.x/include/mxnet/lib_api.h#L606)

leezu commented 4 years ago

See https://github.com/dmlc/gluon-nlp/pull/1325 for doc fix

mli commented 4 years ago

Job PR-1237/34 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/34/index.html

MoisesHer commented 4 years ago

I am not sure about the remaining issue, is it a timeout? if it is, can I avoid it? thanks

szha commented 4 years ago

@MoisesHer yes I think the current test takes too long. Could you try to reduce the time it takes by potentially reducing the workload?

MoisesHer commented 4 years ago

@MoisesHer yes I think the current test takes too long. Could you try to reduce the time it takes by potentially reducing the workload?

Thanks, another question: is there a way for me to trigger CI checks (without new commit)?

chenw23 commented 4 years ago

@MoisesHer yes I think the current test takes too long. Could you try to reduce the time it takes by potentially reducing the workload?

Thanks, another question: is there a way for me to trigger CI checks (without new commit)?

Sure, you can just click into Details of the check to be directed to the jenkins page, and then click Log in button on the right-upper corner. Then click the Rurun button(looks like a arrowed circle) on the upper-right corner.

mli commented 4 years ago

Job PR-1237/37 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1237/37/index.html

MoisesHer commented 4 years ago

I am confused, not sure why this is failing now: MXNetError: Check failed: compileResult == NVRTC_SUCCESS (6 vs. 0) : NVRTC Compilation failed. Please set environment variable MXNET_USE_FUSION to 0.

MoisesHer commented 4 years ago

I am confused, not sure why this is failing now: MXNetError: Check failed: compileResult == NVRTC_SUCCESS (6 vs. 0) : NVRTC Compilation failed. Please set environment variable MXNET_USE_FUSION to 0.

Are all those expand_dims expected?

szha commented 4 years ago

@MoisesHer looks like a compatibility issue. we will address this in a separate PR. thanks for pushing this through!