Open barry-jin opened 4 years ago
It is able to reproduce this error by running a small set of tests.
python3 -m pip install -U --quiet --pre "mxnet-cu102==2.0.0b20201022" -f https://dist.mxnet.io/python
git clone https://github.com/dmlc/gluon-nlp; cd gluon-nlp
git checkout master
python3 -m pip install --quiet -e .[extras]
python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py
There is possible GPU memory leak when running test_models.py::test_tvm_integration
on 10.22 nightly release.
python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py::test_tvm_integration
Here are the logs before and after reverting #19378
@barry-jin : To investigate this problem I need to compile MxNet locally. Do you know what set of cmake options I need to use for that?
From my experience, I just used following commands to build MxNet locally and reproduce the issue:
$ git clone --recursive https://github.com/apache/incubator-mxnet
$ cd incubator-mxnet
$ git checkout 43750c8bfed6ca91fc47fd1fa6d620197e26c84c
$ cp config/linux_gpu.cmake config.cmake
$ mkdir build; cd build
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ..; ninja
$ cd ..
$ python3 -m pip install --user -e ./python
$ cd ~/workspace
$ git clone https://github.com/dmlc/gluon-nlp
$ cd ~/workspace/gluon-nlp
$ git checkout 8c8b0c9cda0853caa88fdbf4e0544986fbef243c
$ python3 -m pip install --quiet -e .[extras]
$ python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py tests/test_models_gpt2.py
Thanks a lot for the script! Unfortunately, I am having a linking problem:
root@28b3a2b8de7a:/opt/mxnet/build# ninja
[1/3] Linking CXX shared library libmxnet.so
FAILED: libmxnet.so
. . .
Error copying file "/opt/mxnet/build/3rdparty/mkldnn/include/dnnl_config.h" to "/opt/mxnet/include/mkldnn/".
ninja: build stopped: subcommand failed.
The file dnnl_config.h
is not presented in any part of incubator-mxnet
You may try to update 3rdparty modules
$ git clean -ffxd
$ git submodule update --init --recursive
@barry-jin : Is it true, that the script you gave me should reproduce this problem? I tried, and I don't see it:
==== 71 passed, 16 skipped, 17 warnings in 1528.46s (0:25:28) ====
Just in case... The 16 tests were skipped, because "JVM is not supported". I'm not sure if a memory problem will show up in one of these tests.
@andrei5055 Thanks for your investigation. I think the warning message should be "TVM is not supported". You can follow tvm documentation to install tvm. Alternatively, I will provide test suite without tvm support that will reproduce this issue.
You can checkout gluon-nlp to https://github.com/dmlc/gluon-nlp/commit/7910d6d247ec9cb1b51cd49d79e3d474b087b188 and run following test suite.
git checkout 7910d6d247ec9cb1b51cd49d79e3d474b087b188
python3 -m pytest --device='gpu' --verbose --runslow tests/test_attention_cell.py tests/test_data_batchify.py tests/test_data_filtering.py tests/test_data_sampler.py tests/test_data_tokenizers.py tests/test_embedding.py tests/test_gluon_block.py tests/test_initializer.py tests/test_layers.py tests/test_loss.py tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py tests/test_models_electra.py tests/test_models_gpt2.py tests/test_models_roberta.py tests/test_models_transformer.py
@barry-jin: Still cannot reproduce this problem:
========== 933 passed, 847 warnings in 2932.28s (0:48:52) =======
BTW, all warnings are of following two types: Type 1:
/opt/mxnet/python/mxnet/gluon/block.py:1098: UserWarning: Parameter 0b7a2e74_c816_4146_bbb2_7973d2ca9112_gamma, 0af6619c_7075_430a_9226_8458e6ca733a_bias, c75fe6d3_81e7_4748_9894_f49abf4b5f2a_bias, 53661f2f_d20f_4c90_a539_173394b859d3_weight, 2b4ce060_94a7_4cd1_ac29_4bdc41789888_weight, e19ccd3d_cc61_44b2_ab1a_20e88f571877_bias, 8f53b519_069f_415a_bd05_c8b4ec58dd24_const, 99d015d6_eeca_4ad6_9fc6_1fb55e43b0f7_weight, 711c0a20_91e2_43c3_ba41_48f5fd2a3398_gamma, d852d48d_ca52_408a_83f3_2c11bf3a01b8_beta, e0417d39_d73a_4101_a440_f992b45a176e_weight, 3f5329d5_0903_448a_8c7a_65536aa507a1_bias, d08c8d34_3bca_4006_9843_aa5d069767cf_beta is not used by any computation. Is this intended?
self._build_cache(*args)
Type 2:
/opt/mxnet/python/mxnet/registry.py:108: UserWarning: New initializer mxnet.gluon.parameter.Init registered with name constant_140658119590520 isoverriding existing initializer mxnet.gluon.parameter.Init
register(klass, name)
Description
pytest
onmxnet-cu102==2.0.0b20201022
will introduce threading error (see Error Message).mxnet-cu102==2.0.0b20201016
will not introduce this error.Error Message
Run GluonNLP pytest with `mxnet-cu102==2.0.0b20201022`
``` [2020-10-22T21:15:51.430Z] ============================= test session starts ============================== [2020-10-22T21:15:51.430Z] platform linux -- Python 3.6.9, pytest-6.1.1, py-1.9.0, pluggy-0.13.1 [2020-10-22T21:15:51.432Z] rootdir: /workspace/gluon-nlp, configfile: pytest.ini [2020-10-22T21:15:51.432Z] plugins: cov-2.10.1 [2020-10-22T21:15:52.426Z] collected 1283 items [2020-10-22T21:16:01.630Z] tests/test_attention_cell.py ........................................... [ 3%] [2020-10-22T21:16:06.668Z] ...................................................................... [ 8%] [2020-10-22T21:16:06.796Z] tests/test_data_batchify.py ............................................ [ 12%] [2020-10-22T21:16:21.672Z] ................................. [ 14%] [2020-10-22T21:16:30.051Z] tests/test_data_filtering.py ..... [ 15%] [2020-10-22T21:16:36.895Z] tests/test_data_loading.py . [ 15%] [2020-10-22T21:16:37.213Z] tests/test_data_sampler.py ............................................. [ 18%] [2020-10-22T21:16:38.566Z] ........................................................................ [ 24%] [2020-10-22T21:16:40.003Z] ........................................................................ [ 30%] [2020-10-22T21:16:40.579Z] ........................................................................ [ 35%] [2020-10-22T21:16:41.143Z] ........................................................................ [ 41%] [2020-10-22T21:16:42.040Z] ........................................................................ [ 46%] [2020-10-22T21:16:42.299Z] ............... [ 48%] [2020-10-22T21:18:34.088Z] tests/test_data_tokenizers.py .............. [ 49%] [2020-10-22T21:18:34.095Z] tests/test_data_vocab.py . [ 49%] [2020-10-22T21:22:22.268Z] tests/test_embedding.py .. [ 49%] [2020-10-22T21:22:59.289Z] tests/test_gluon_block.py ..... [ 49%] [2020-10-22T21:22:59.328Z] tests/test_initializer.py ... [ 49%] [2020-10-22T21:23:00.225Z] tests/test_layers.py ........................... [ 52%] [2020-10-22T21:23:00.312Z] tests/test_loss.py ........................ [ 53%] [2020-10-22T21:37:39.851Z] tests/test_models.py ................................................ [ 57%] [2020-10-22T21:38:46.438Z] tests/test_models_albert.py ................. [ 59%] [2020-10-22T21:39:38.599Z] tests/test_models_bart.py ...... [ 59%] [2020-10-22T21:44:18.743Z] tests/test_models_bert.py ............ [ 60%] [2020-10-22T21:46:00.142Z] tests/test_models_electra.py ........ [ 61%] [2020-10-22T21:49:47.086Z] tests/test_models_gpt2.py .......F [ 61%] [2020-10-22T21:49:57.226Z] tests/test_models_mobilebert.py ..... [ 62%] [2020-10-22T21:51:27.552Z] tests/test_models_roberta.py ....FF [ 62%] [2020-10-22T21:52:10.783Z] tests/test_models_transformer.py ....................................... [ 65%] [2020-10-22T21:53:33.876Z] ........................................................................ [ 71%] [2020-10-22T21:54:26.540Z] ..........................................FFFFF [ 74%] [2020-10-22T21:54:34.975Z] tests/test_models_transformer_xl.py ...... [ 75%] [2020-10-22T21:55:47.820Z] tests/test_models_xlmr.py .FF [ 75%] [2020-10-22T21:55:48.122Z] tests/test_op.py ....................................................... [ 79%] [2020-10-22T21:55:48.754Z] ........................................................................ [ 85%] [2020-10-22T21:55:49.195Z] .... [ 85%] [2020-10-22T21:56:20.712Z] tests/test_optimizer.py . [ 85%] [2020-10-22T21:56:20.716Z] tests/test_pytest.py . [ 85%] [2020-10-22T21:56:21.005Z] tests/test_sequence_sampler.py ......................................... [ 89%] [2020-10-22T21:56:21.522Z] ........................................................................ [ 94%] [2020-10-22T21:56:33.345Z] ....................................... [ 97%] [2020-10-22T21:56:33.590Z] Fatal Python error: Aborted [2020-10-22T21:56:33.590Z] Thread 0x00007f92b9fff700 (most recent call first): [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/threading.py", line 299 in wait [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/threading.py", line 551 in wait [2020-10-22T21:56:33.590Z] File "/usr/local/lib/python3.6/dist-packages/tqdm/_monitor.py", line 59 in run [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap [2020-10-22T21:56:33.590Z] Current thread 0x00007f9457153740 (most recent call first): [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 66 in _launch [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19 in __init__ [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/multiprocessing/context.py", line 277 in _Popen [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/multiprocessing/process.py", line 105 in start [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/multiprocessing/pool.py", line 239 in _repopulate_pool [2020-10-22T21:56:33.591Z] File "/usr/lib/python3.6/multiprocessing/pool.py", line 174 in __init__ [2020-10-22T21:56:33.591Z] File "/usr/lib/python3.6/multiprocessing/context.py", line 119 in Pool [2020-10-22T21:56:33.591Z] File "/workspace/gluon-nlp/tests/test_utils_misc.py", line 87 in verify_download [2020-10-22T21:56:33.591Z] File "/workspace/gluon-nlp/tests/test_utils_misc.py", line 102 in test_download_s3 [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/_pytest/python.py", line 184 in pytest_pyfunc_call [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 inTo Reproduce
run reproduce.sh
reproduce.sh
```bash #!/bin/bash python3 -m pip install -U --quiet --pre "mxnet-cu102==2.0.0b20201022" -f https://dist.mxnet.io/python git clone https://github.com/dmlc/gluon-nlp; cd gluon-nlp git checkout master python3 -m pip install --quiet -e .[extras] python3 -m pytest --cov=. --cov-config=./.coveragerc --cov-report=xml --durations=50 --device="gpu" --runslow ./tests/ ```What have you tried to solve it?
Some observations:
mx.npx.waitall()
multiprocessing.Pool()
mxnet-cu102==2.0.0b20201016
andmxnet-cu102==2.0.0b20201022
, I find the first bad commit is #19378Environment
We recommend using our script for collecting the diagnostic information with the following command
curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3
Environment Information
``` [2020-10-27T16:59:27.002Z] ----------Python Info---------- [2020-10-27T16:59:27.002Z] Version : 3.6.9 [2020-10-27T16:59:27.002Z] Compiler : GCC 8.4.0 [2020-10-27T16:59:27.002Z] Build : ('default', 'Oct 8 2020 12:12:24') [2020-10-27T16:59:27.003Z] Arch : ('64bit', '') [2020-10-27T16:59:27.003Z] ------------Pip Info----------- [2020-10-27T16:59:27.004Z] Version : 20.2.4 [2020-10-27T16:59:27.004Z] Directory : /usr/local/lib/python3.6/dist-packages/pip [2020-10-27T16:59:27.004Z] ----------MXNet Info----------- [2020-10-27T16:59:28.271Z] Version : 2.0.0 [2020-10-27T16:59:28.271Z] Directory : /root/.local/lib/python3.6/site-packages/mxnet [2020-10-27T16:59:28.271Z] Commit hash file "/root/.local/lib/python3.6/site-packages/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source. [2020-10-27T16:59:28.271Z] Library : ['/root/.local/lib/python3.6/site-packages/mxnet/libmxnet.so'] [2020-10-27T16:59:28.271Z] Build features: [2020-10-27T16:59:28.271Z] ✔ CUDA [2020-10-27T16:59:28.271Z] ✔ CUDNN [2020-10-27T16:59:28.271Z] ✖ NCCL [2020-10-27T16:59:28.271Z] ✖ TENSORRT [2020-10-27T16:59:28.271Z] ✖ CUTENSOR [2020-10-27T16:59:28.271Z] ✔ CPU_SSE [2020-10-27T16:59:28.271Z] ✔ CPU_SSE2 [2020-10-27T16:59:28.271Z] ✔ CPU_SSE3 [2020-10-27T16:59:28.271Z] ✖ CPU_SSE4_1 [2020-10-27T16:59:28.271Z] ✖ CPU_SSE4_2 [2020-10-27T16:59:28.271Z] ✖ CPU_SSE4A [2020-10-27T16:59:28.271Z] ✖ CPU_AVX [2020-10-27T16:59:28.271Z] ✖ CPU_AVX2 [2020-10-27T16:59:28.271Z] ✔ OPENMP [2020-10-27T16:59:28.271Z] ✖ SSE [2020-10-27T16:59:28.271Z] ✖ F16C [2020-10-27T16:59:28.271Z] ✖ JEMALLOC [2020-10-27T16:59:28.271Z] ✔ BLAS_OPEN [2020-10-27T16:59:28.271Z] ✖ BLAS_ATLAS [2020-10-27T16:59:28.271Z] ✖ BLAS_MKL [2020-10-27T16:59:28.271Z] ✖ BLAS_APPLE [2020-10-27T16:59:28.271Z] ✔ LAPACK [2020-10-27T16:59:28.271Z] ✔ MKLDNN [2020-10-27T16:59:28.271Z] ✔ OPENCV [2020-10-27T16:59:28.271Z] ✔ DIST_KVSTORE [2020-10-27T16:59:28.271Z] ✖ INT64_TENSOR_SIZE [2020-10-27T16:59:28.271Z] ✔ SIGNAL_HANDLER [2020-10-27T16:59:28.271Z] ✖ DEBUG [2020-10-27T16:59:28.271Z] ✖ TVM_OP [2020-10-27T16:59:28.271Z] ----------System Info---------- [2020-10-27T16:59:28.272Z] Platform : Linux-4.14.186-146.268.amzn2.x86_64-x86_64-with-Ubuntu-18.04-bionic [2020-10-27T16:59:28.272Z] system : Linux [2020-10-27T16:59:28.272Z] node : ip-10-20-91-122.ec2.internal [2020-10-27T16:59:28.272Z] release : 4.14.186-146.268.amzn2.x86_64 [2020-10-27T16:59:28.272Z] version : #1 SMP Tue Jul 14 18:16:52 UTC 2020 [2020-10-27T16:59:28.272Z] ----------Hardware Info---------- [2020-10-27T16:59:28.272Z] machine : x86_64 [2020-10-27T16:59:28.272Z] processor : x86_64 [2020-10-27T16:59:28.297Z] Architecture: x86_64 [2020-10-27T16:59:28.297Z] CPU op-mode(s): 32-bit, 64-bit [2020-10-27T16:59:28.297Z] Byte Order: Little Endian [2020-10-27T16:59:28.297Z] CPU(s): 16 [2020-10-27T16:59:28.297Z] On-line CPU(s) list: 0-15 [2020-10-27T16:59:28.297Z] Thread(s) per core: 2 [2020-10-27T16:59:28.297Z] Core(s) per socket: 8 [2020-10-27T16:59:28.297Z] Socket(s): 1 [2020-10-27T16:59:28.297Z] NUMA node(s): 1 [2020-10-27T16:59:28.297Z] Vendor ID: GenuineIntel [2020-10-27T16:59:28.297Z] CPU family: 6 [2020-10-27T16:59:28.297Z] Model: 85 [2020-10-27T16:59:28.297Z] Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz [2020-10-27T16:59:28.297Z] Stepping: 7 [2020-10-27T16:59:28.297Z] CPU MHz: 3103.458 [2020-10-27T16:59:28.297Z] BogoMIPS: 4999.99 [2020-10-27T16:59:28.297Z] Hypervisor vendor: KVM [2020-10-27T16:59:28.297Z] Virtualization type: full [2020-10-27T16:59:28.297Z] L1d cache: 32K [2020-10-27T16:59:28.297Z] L1i cache: 32K [2020-10-27T16:59:28.297Z] L2 cache: 1024K [2020-10-27T16:59:28.297Z] L3 cache: 36608K [2020-10-27T16:59:28.297Z] NUMA node0 CPU(s): 0-15 [2020-10-27T16:59:28.297Z] Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni [2020-10-27T16:59:28.298Z] ----------Network Test---------- [2020-10-27T16:59:28.298Z] Setting timeout: 10 [2020-10-27T16:59:28.766Z] Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0007 sec, LOAD: 0.4678 sec. [2020-10-27T16:59:29.018Z] Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0861 sec, LOAD: 0.1656 sec. [2020-10-27T16:59:29.168Z] Error open Gluon Tutorial(cn): https://zh.gluon.ai,