Closed szha closed 3 years ago
The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1512/5ce7c5f1b8e212a853a4d08717e0ccf875b7822a/index.html
The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1512/f9a5fb71925c75e9ca484b7a0e908756319460bf/index.html
Merging #1512 (0a41311) into master (8d31297) will decrease coverage by
0.62%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## master #1512 +/- ##
==========================================
- Coverage 86.49% 85.87% -0.63%
==========================================
Files 55 55
Lines 7502 7396 -106
==========================================
- Hits 6489 6351 -138
- Misses 1013 1045 +32
Impacted Files | Coverage Δ | |
---|---|---|
setup.py | 0.00% <ø> (ø) |
|
src/gluonnlp/utils/misc.py | 54.86% <100.00%> (+0.21%) |
:arrow_up: |
conftest.py | 76.31% <0.00%> (-9.94%) |
:arrow_down: |
src/gluonnlp/data/loading.py | 75.75% <0.00%> (-7.64%) |
:arrow_down: |
src/gluonnlp/utils/lazy_imports.py | 58.42% <0.00%> (-2.25%) |
:arrow_down: |
src/gluonnlp/data/tokenizers/spacy.py | 65.33% <0.00%> (-0.91%) |
:arrow_down: |
src/gluonnlp/data/tokenizers/huggingface.py | 71.06% <0.00%> (-0.49%) |
:arrow_down: |
src/gluonnlp/data/tokenizers/jieba.py | 73.13% <0.00%> (-0.40%) |
:arrow_down: |
src/gluonnlp/models/transformer_xl.py | 80.48% <0.00%> (-0.39%) |
:arrow_down: |
... and 19 more |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 8d31297...0a41311. Read the comment docs.
@szha From the error message, it seems to be related to how the mxnet integration is written. Currently, the horovod will call MXNet API to determine some GPU-related flags, and will fail if the instance that is used does not contain GPU or is not configured appropriately. You may follow the guide in https://github.com/dmlc/gluon-nlp/tree/master/tools/docker#build-by-yourself and try again (Need to edit /etc/docker/daemon.json
).
@sxjscience thanks. I think my system already has nvidia-docker2 installed and the config entry added. I think you are right that this has to do with how horovod integration is written. It's having trouble finding mxnet for some reason.
OK, because I find that there are the following warning in the log so I thought that GPU was not used.
/root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
/root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
/root/.local/lib/python3.6/site-packages/torch/cuda/__init__.py:52: UserWarning: CUDA initialization: Found no NVIDIA driver on your system. Please check that you have an NVIDIA GPU and installed a driver from http://www.nvidia.com/Download/index.aspx (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:100.)
return torch._C._cuda_getDeviceCount() > 0
@sxjscience using nvidia-docker
instead of docker
command (i.e. to turn on nvidia docker runtime) should resolve that particular warning.
I think we may try to automate our docker pipeline.
horovod build error
@szha do you mean this error occurs when you rebuild the container?
@sxjscience using
nvidia-docker
instead ofdocker
command (i.e. to turn on nvidia docker runtime) should resolve that particular warning.
That's not correct, because nvidia-docker only takes effect at runtime and not at buildtime. You need to follow the steps in https://github.com/dmlc/gluon-nlp/tree/master/tools/docker#build-by-yourself
@leezu @sxjscience thanks for helping. I noticed that previously I missed the "default-runtime" entry in the config. Sorry for the miss. I was able to complete the build after adding that entry and I'm pushing the GPU docker now.
looks like there might be an upstream change as tests/test_data_tokenizers.py::test_spacy_tokenizer failed.
The documentation website for preview: http://gluon-nlp-staging.s3-accelerate.dualstack.amazonaws.com/PR1512/0a41311da7e394cce3459f93c90beef34c55f767/index.html
Description
add decorator for logging exceptions
Checklist
Essentials
Changes
Comments
cc @dmlc/gluon-nlp-team