MoisesHer commented 3 years ago

Description

Change custom graph pass implementation to make it compatible with MXNet 1.8 Solving issue https://github.com/dmlc/gluon-nlp/issues/1388

Checklist

Essentials

[x] Changes are complete (i.e. I finished coding on this PR)
[x] All changes have test coverage
[x] Code is well-documented

Changes

[x] Change custom graph pass to support both MXNet 1.7 & MXNet 1.8
[x] Change setup and deploy scripts accordingly
[x] Activate CUDA Graphs for MXNet 1.8 (> 30% speedup with small batch sizes)

cc @dmlc/gluon-nlp-team, @samskalicky, @Kh4L

mli commented 3 years ago

Job PR-1389/1 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/1/index.html

mli commented 3 years ago

Job PR-1389/2 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/2/index.html

MoisesHer commented 3 years ago

I am getting the following error, not sure why:

[2020-10-16T20:59:59.298Z] ERROR: Could not find a version that satisfies the requirement scikit-learn==0.23.2 (from seqeval->-r /var/lib/jenkins/workspace/gluon-nlp-gpu-py3@2/env/gpu/condaenv.vu5hlg68.requirements.txt (line 34)) (from versions: 0.9, 0.10, 0.11, 0.12, 0.12.1, 0.13, 0.13.1, 0.14, 0.14.1, 0.15.0b1, 0.15.0b2, 0.15.0, 0.15.1, 0.15.2, 0.16b1, 0.16.0, 0.16.1, 0.17b1, 0.17, 0.17.1, 0.18rc2, 0.18, 0.18.1, 0.18.2, 0.19b2, 0.19.0, 0.19.1, 0.19.2, 0.20rc1, 0.20.0, 0.20.1, 0.20.2, 0.20.3, 0.20.4, 0.21rc2, 0.21.0, 0.21.1, 0.21.2, 0.21.3, 0.22rc2.post1, 0.22rc3, 0.22, 0.22.1, 0.22.2.post1)
[2020-10-16T20:59:59.298Z] ERROR: No matching distribution found for scikit-learn==0.23.2 (from seqeval->-r /var/lib/jenkins/workspace/gluon-nlp-gpu-py3@2/env/gpu/condaenv.vu5hlg68.requirements.txt (line 34))
[2020-10-16T20:59:59.298Z] CondaEnvException: Pip failed

mli commented 3 years ago

Job PR-1389/3 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/3/index.html

TristonC commented 3 years ago

@szha Could we have someone help for the CI failure?

mli commented 3 years ago

Job PR-1389/4 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/4/index.html

mli commented 3 years ago

Job PR-1389/5 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/5/index.html

mli commented 3 years ago

Job PR-1389/6 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/6/index.html

mli commented 3 years ago

Job PR-1389/7 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/7/index.html

mli commented 3 years ago

Job PR-1389/9 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/9/index.html

mli commented 3 years ago

Job PR-1389/10 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/10/index.html

MoisesHer commented 3 years ago

@szha @TristonC This will never pass the CI tests unless some of the wheels on https://dist.mxnet.io/python include mxnet-build/src/lib_api.cc file.

https://github.com/apache/incubator-mxnet/blame/2fc0706874531fdfdbe49819eae0c88f8016eee3/tools/pip/setup.py#L108

szha commented 3 years ago

I think the cc files are missed from the MANIFEST.in. As a result the cc files are not included in the wheel. cc @samskalicky

@MoisesHer for now would it work if you include the cc file here to get unblocked first?

MoisesHer commented 3 years ago

thanks @szha , Do you mean to included the .cc as part of this PR within Gluon-NLP?

szha commented 3 years ago

Yes, I think that would be the fastest way to get unblocked on this PR.

samskalicky commented 3 years ago

I think the cc files are missed from the MANIFEST.in. As a result the cc files are not included in the wheel. cc @samskalicky

@MoisesHer for now would it work if you include the cc file here to get unblocked first?

im confused, we already had this PR https://github.com/apache/incubator-mxnet/pull/19393 to add the file in the pip wheel. After this @MoisesHer was able to get this working (hence the PR). But why is it not working all of a sudden, did something else change?

@MoisesHer are you using a diff pip wheel than before? Is that wheel built differently?

mli commented 3 years ago

Job PR-1389/11 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/11/index.html

MoisesHer commented 3 years ago

@samskalicky I was able to make it work locally by cloning MXNet 1.8 and compiling, but I think we never included the lib_api.cc within a wheel

samskalicky commented 3 years ago

@samskalicky I was able to make it work locally by cloning MXNet 1.8 and compiling, but I think we never included the lib_api.cc within a wheel

i added this line: https://github.com/apache/incubator-mxnet/blob/v1.8.x/tools/pip/setup.py#L108

As part of the https://github.com/apache/incubator-mxnet/pull/19393 for this specific reason (after our offline discussion on Slack). I remember building the wheel and making sure the file was there.

Maybe i built the wheel differently in my testing than the aws-mx one (or the ones on dist.mxnet.io).

Either way, ill work with @szha on https://github.com/apache/incubator-mxnet/pull/19850 and we'll backport as necessary

szha commented 3 years ago

@samskalicky the problem is that the files will only be included according to the manifest file. The if conditions in setup.py only copies the file in the location of packaging.

MoisesHer commented 3 years ago

@szha apart of the .cc file, the remaining issue is not related to this PR:

./scripts/ner/finetune_bert.py", line 34, in <module>
[2021-02-05T19:30:49.978Z]     import seqeval.metrics
...
File "./scripts/intent_cls_slot_labeling/finetune_icsl.py", line 39, in <module>
[2021-02-05T19:48:32.411Z]     from seqeval.metrics import f1_score as ner_f1_score
...
AttributeError: module 'enum' has no attribute 'Flag'

szha commented 3 years ago

@MoisesHer I see. I think the CI needs an update to the python versions. Working on it.

szha commented 3 years ago

I finished updating the conda versions on all CI hosts (as well as the docker image). Will babysit this PR.

mli commented 3 years ago

Job PR-1389/12 is complete. Docs are uploaded to http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR-1389/12/index.html

szha commented 3 years ago

Not sure why python3.5 is still picked up. I'm looking into the CI.

MoisesHer commented 3 years ago

Hi @szha, do you have any update? thanks

szha commented 3 years ago

We will switch the 0.x CI to github actions after #1531

codecov[bot] commented 3 years ago

Codecov Report

Merging #1389 (309351d) into v0.10.x (a3ba807) will decrease coverage by 0.01%. The diff coverage is 0.00%.

@@             Coverage Diff             @@
##           v0.10.x    #1389      +/-   ##
===========================================
- Coverage    33.44%   33.43%   -0.02%     
===========================================
  Files          155      155              
  Lines        15206    15213       +7     
===========================================
  Hits          5086     5086              
- Misses       10120    10127       +7

Impacted Files	Coverage Δ
scripts/bert/deploy.py	`0.00% <0.00%> (ø)`
scripts/bert/setup.py	`0.00% <0.00%> (ø)`

github-actions[bot] commented 3 years ago

The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1389/309351d3983077bac905d9df4f4575594f1633b5/index.html

szha commented 3 years ago

@MoisesHer thanks for the change. Would you also port the change to v0.x?

dmlc / gluon-nlp

Make BERT-GPU deploy compatible with MXNet 1.8 #1389

Description

Checklist

Essentials

Changes

Codecov Report