apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Memory allocation failed out of memory #19420

Open barry-jin opened 4 years ago

barry-jin commented 4 years ago

Description

  1. Run GluonNLP full suite of tests with pytest on mxnet-cu102==2.0.0b20201022 will introduce threading error (see Error Message).
  2. But run full suite of tests on mxnet-cu102==2.0.0b20201016 will not introduce this error.
  3. Also, run these tests separately will not introduce this error.

Error Message

Run GluonNLP pytest with `mxnet-cu102==2.0.0b20201022` ``` [2020-10-22T21:15:51.430Z] ============================= test session starts ============================== [2020-10-22T21:15:51.430Z] platform linux -- Python 3.6.9, pytest-6.1.1, py-1.9.0, pluggy-0.13.1 [2020-10-22T21:15:51.432Z] rootdir: /workspace/gluon-nlp, configfile: pytest.ini [2020-10-22T21:15:51.432Z] plugins: cov-2.10.1 [2020-10-22T21:15:52.426Z] collected 1283 items [2020-10-22T21:16:01.630Z] tests/test_attention_cell.py ........................................... [ 3%] [2020-10-22T21:16:06.668Z] ...................................................................... [ 8%] [2020-10-22T21:16:06.796Z] tests/test_data_batchify.py ............................................ [ 12%] [2020-10-22T21:16:21.672Z] ................................. [ 14%] [2020-10-22T21:16:30.051Z] tests/test_data_filtering.py ..... [ 15%] [2020-10-22T21:16:36.895Z] tests/test_data_loading.py . [ 15%] [2020-10-22T21:16:37.213Z] tests/test_data_sampler.py ............................................. [ 18%] [2020-10-22T21:16:38.566Z] ........................................................................ [ 24%] [2020-10-22T21:16:40.003Z] ........................................................................ [ 30%] [2020-10-22T21:16:40.579Z] ........................................................................ [ 35%] [2020-10-22T21:16:41.143Z] ........................................................................ [ 41%] [2020-10-22T21:16:42.040Z] ........................................................................ [ 46%] [2020-10-22T21:16:42.299Z] ............... [ 48%] [2020-10-22T21:18:34.088Z] tests/test_data_tokenizers.py .............. [ 49%] [2020-10-22T21:18:34.095Z] tests/test_data_vocab.py . [ 49%] [2020-10-22T21:22:22.268Z] tests/test_embedding.py .. [ 49%] [2020-10-22T21:22:59.289Z] tests/test_gluon_block.py ..... [ 49%] [2020-10-22T21:22:59.328Z] tests/test_initializer.py ... [ 49%] [2020-10-22T21:23:00.225Z] tests/test_layers.py ........................... [ 52%] [2020-10-22T21:23:00.312Z] tests/test_loss.py ........................ [ 53%] [2020-10-22T21:37:39.851Z] tests/test_models.py ................................................ [ 57%] [2020-10-22T21:38:46.438Z] tests/test_models_albert.py ................. [ 59%] [2020-10-22T21:39:38.599Z] tests/test_models_bart.py ...... [ 59%] [2020-10-22T21:44:18.743Z] tests/test_models_bert.py ............ [ 60%] [2020-10-22T21:46:00.142Z] tests/test_models_electra.py ........ [ 61%] [2020-10-22T21:49:47.086Z] tests/test_models_gpt2.py .......F [ 61%] [2020-10-22T21:49:57.226Z] tests/test_models_mobilebert.py ..... [ 62%] [2020-10-22T21:51:27.552Z] tests/test_models_roberta.py ....FF [ 62%] [2020-10-22T21:52:10.783Z] tests/test_models_transformer.py ....................................... [ 65%] [2020-10-22T21:53:33.876Z] ........................................................................ [ 71%] [2020-10-22T21:54:26.540Z] ..........................................FFFFF [ 74%] [2020-10-22T21:54:34.975Z] tests/test_models_transformer_xl.py ...... [ 75%] [2020-10-22T21:55:47.820Z] tests/test_models_xlmr.py .FF [ 75%] [2020-10-22T21:55:48.122Z] tests/test_op.py ....................................................... [ 79%] [2020-10-22T21:55:48.754Z] ........................................................................ [ 85%] [2020-10-22T21:55:49.195Z] .... [ 85%] [2020-10-22T21:56:20.712Z] tests/test_optimizer.py . [ 85%] [2020-10-22T21:56:20.716Z] tests/test_pytest.py . [ 85%] [2020-10-22T21:56:21.005Z] tests/test_sequence_sampler.py ......................................... [ 89%] [2020-10-22T21:56:21.522Z] ........................................................................ [ 94%] [2020-10-22T21:56:33.345Z] ....................................... [ 97%] [2020-10-22T21:56:33.590Z] Fatal Python error: Aborted [2020-10-22T21:56:33.590Z] Thread 0x00007f92b9fff700 (most recent call first): [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/threading.py", line 299 in wait [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/threading.py", line 551 in wait [2020-10-22T21:56:33.590Z] File "/usr/local/lib/python3.6/dist-packages/tqdm/_monitor.py", line 59 in run [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/threading.py", line 916 in _bootstrap_inner [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/threading.py", line 884 in _bootstrap [2020-10-22T21:56:33.590Z] Current thread 0x00007f9457153740 (most recent call first): [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 66 in _launch [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/multiprocessing/popen_fork.py", line 19 in __init__ [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/multiprocessing/context.py", line 277 in _Popen [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/multiprocessing/process.py", line 105 in start [2020-10-22T21:56:33.590Z] File "/usr/lib/python3.6/multiprocessing/pool.py", line 239 in _repopulate_pool [2020-10-22T21:56:33.591Z] File "/usr/lib/python3.6/multiprocessing/pool.py", line 174 in __init__ [2020-10-22T21:56:33.591Z] File "/usr/lib/python3.6/multiprocessing/context.py", line 119 in Pool [2020-10-22T21:56:33.591Z] File "/workspace/gluon-nlp/tests/test_utils_misc.py", line 87 in verify_download [2020-10-22T21:56:33.591Z] File "/workspace/gluon-nlp/tests/test_utils_misc.py", line 102 in test_download_s3 [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/_pytest/python.py", line 184 in pytest_pyfunc_call [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 in [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__ [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/_pytest/python.py", line 1627 in runtest [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 163 in pytest_runtest_call [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall [2020-10-22T21:56:33.591Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 in [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__ [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 256 in [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 310 in from_call [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 256 in call_runtest_hook [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 216 in call_and_report [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 127 in runtestprotocol [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/_pytest/runner.py", line 110 in pytest_runtest_protocol [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 in [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec [2020-10-22T21:56:33.592Z] File "/root/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__ [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/_pytest/main.py", line 338 in pytest_runtestloop [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 in [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__ [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/_pytest/main.py", line 313 in _main [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/_pytest/main.py", line 257 in wrap_session [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/_pytest/main.py", line 306 in pytest_cmdline_main [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/pluggy/callers.py", line 187 in _multicall [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 87 in [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/pluggy/manager.py", line 93 in _hookexec [2020-10-22T21:56:33.593Z] File "/root/.local/lib/python3.6/site-packages/pluggy/hooks.py", line 286 in __call__ [2020-10-22T21:56:33.594Z] File "/root/.local/lib/python3.6/site-packages/_pytest/config/__init__.py", line 165 in main [2020-10-22T21:56:33.594Z] File "/root/.local/lib/python3.6/site-packages/_pytest/config/__init__.py", line 187 in console_main [2020-10-22T21:56:33.594Z] File "/root/.local/lib/python3.6/site-packages/pytest/__main__.py", line 5 in [2020-10-22T21:56:33.594Z] File "/usr/lib/python3.6/runpy.py", line 85 in _run_code [2020-10-22T21:56:33.594Z] File "/usr/lib/python3.6/runpy.py", line 193 in _run_module_as_main [2020-10-22T22:00:07.664Z] ./gluon_nlp_job.sh: line 39: 44 Aborted (core dumped) /bin/bash -o pipefail -c "$COMMAND" ```

To Reproduce

Compute Environment: 
Instance type: g4dn.4x
vCPUs: 16 

run reproduce.sh

reproduce.sh ```bash #!/bin/bash python3 -m pip install -U --quiet --pre "mxnet-cu102==2.0.0b20201022" -f https://dist.mxnet.io/python git clone https://github.com/dmlc/gluon-nlp; cd gluon-nlp git checkout master python3 -m pip install --quiet -e .[extras] python3 -m pytest --cov=. --cov-config=./.coveragerc --cov-report=xml --durations=50 --device="gpu" --runslow ./tests/ ```
$ chmod +x reproduce.sh
$ ./reproduce.sh

What have you tried to solve it?

Some observations:

  1. The failed tests all use mx.npx.waitall()
  2. The test failed on multiprocessing.Pool()
  3. After bisect by commits between nightly build mxnet-cu102==2.0.0b20201016 and mxnet-cu102==2.0.0b20201022, I find the first bad commit is #19378

Environment

We recommend using our script for collecting the diagnostic information with the following command curl --retry 10 -s https://raw.githubusercontent.com/apache/incubator-mxnet/master/tools/diagnose.py | python3

Environment Information ``` [2020-10-27T16:59:27.002Z] ----------Python Info---------- [2020-10-27T16:59:27.002Z] Version : 3.6.9 [2020-10-27T16:59:27.002Z] Compiler : GCC 8.4.0 [2020-10-27T16:59:27.002Z] Build : ('default', 'Oct 8 2020 12:12:24') [2020-10-27T16:59:27.003Z] Arch : ('64bit', '') [2020-10-27T16:59:27.003Z] ------------Pip Info----------- [2020-10-27T16:59:27.004Z] Version : 20.2.4 [2020-10-27T16:59:27.004Z] Directory : /usr/local/lib/python3.6/dist-packages/pip [2020-10-27T16:59:27.004Z] ----------MXNet Info----------- [2020-10-27T16:59:28.271Z] Version : 2.0.0 [2020-10-27T16:59:28.271Z] Directory : /root/.local/lib/python3.6/site-packages/mxnet [2020-10-27T16:59:28.271Z] Commit hash file "/root/.local/lib/python3.6/site-packages/mxnet/COMMIT_HASH" not found. Not installed from pre-built package or built from source. [2020-10-27T16:59:28.271Z] Library : ['/root/.local/lib/python3.6/site-packages/mxnet/libmxnet.so'] [2020-10-27T16:59:28.271Z] Build features: [2020-10-27T16:59:28.271Z] ✔ CUDA [2020-10-27T16:59:28.271Z] ✔ CUDNN [2020-10-27T16:59:28.271Z] ✖ NCCL [2020-10-27T16:59:28.271Z] ✖ TENSORRT [2020-10-27T16:59:28.271Z] ✖ CUTENSOR [2020-10-27T16:59:28.271Z] ✔ CPU_SSE [2020-10-27T16:59:28.271Z] ✔ CPU_SSE2 [2020-10-27T16:59:28.271Z] ✔ CPU_SSE3 [2020-10-27T16:59:28.271Z] ✖ CPU_SSE4_1 [2020-10-27T16:59:28.271Z] ✖ CPU_SSE4_2 [2020-10-27T16:59:28.271Z] ✖ CPU_SSE4A [2020-10-27T16:59:28.271Z] ✖ CPU_AVX [2020-10-27T16:59:28.271Z] ✖ CPU_AVX2 [2020-10-27T16:59:28.271Z] ✔ OPENMP [2020-10-27T16:59:28.271Z] ✖ SSE [2020-10-27T16:59:28.271Z] ✖ F16C [2020-10-27T16:59:28.271Z] ✖ JEMALLOC [2020-10-27T16:59:28.271Z] ✔ BLAS_OPEN [2020-10-27T16:59:28.271Z] ✖ BLAS_ATLAS [2020-10-27T16:59:28.271Z] ✖ BLAS_MKL [2020-10-27T16:59:28.271Z] ✖ BLAS_APPLE [2020-10-27T16:59:28.271Z] ✔ LAPACK [2020-10-27T16:59:28.271Z] ✔ MKLDNN [2020-10-27T16:59:28.271Z] ✔ OPENCV [2020-10-27T16:59:28.271Z] ✔ DIST_KVSTORE [2020-10-27T16:59:28.271Z] ✖ INT64_TENSOR_SIZE [2020-10-27T16:59:28.271Z] ✔ SIGNAL_HANDLER [2020-10-27T16:59:28.271Z] ✖ DEBUG [2020-10-27T16:59:28.271Z] ✖ TVM_OP [2020-10-27T16:59:28.271Z] ----------System Info---------- [2020-10-27T16:59:28.272Z] Platform : Linux-4.14.186-146.268.amzn2.x86_64-x86_64-with-Ubuntu-18.04-bionic [2020-10-27T16:59:28.272Z] system : Linux [2020-10-27T16:59:28.272Z] node : ip-10-20-91-122.ec2.internal [2020-10-27T16:59:28.272Z] release : 4.14.186-146.268.amzn2.x86_64 [2020-10-27T16:59:28.272Z] version : #1 SMP Tue Jul 14 18:16:52 UTC 2020 [2020-10-27T16:59:28.272Z] ----------Hardware Info---------- [2020-10-27T16:59:28.272Z] machine : x86_64 [2020-10-27T16:59:28.272Z] processor : x86_64 [2020-10-27T16:59:28.297Z] Architecture: x86_64 [2020-10-27T16:59:28.297Z] CPU op-mode(s): 32-bit, 64-bit [2020-10-27T16:59:28.297Z] Byte Order: Little Endian [2020-10-27T16:59:28.297Z] CPU(s): 16 [2020-10-27T16:59:28.297Z] On-line CPU(s) list: 0-15 [2020-10-27T16:59:28.297Z] Thread(s) per core: 2 [2020-10-27T16:59:28.297Z] Core(s) per socket: 8 [2020-10-27T16:59:28.297Z] Socket(s): 1 [2020-10-27T16:59:28.297Z] NUMA node(s): 1 [2020-10-27T16:59:28.297Z] Vendor ID: GenuineIntel [2020-10-27T16:59:28.297Z] CPU family: 6 [2020-10-27T16:59:28.297Z] Model: 85 [2020-10-27T16:59:28.297Z] Model name: Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz [2020-10-27T16:59:28.297Z] Stepping: 7 [2020-10-27T16:59:28.297Z] CPU MHz: 3103.458 [2020-10-27T16:59:28.297Z] BogoMIPS: 4999.99 [2020-10-27T16:59:28.297Z] Hypervisor vendor: KVM [2020-10-27T16:59:28.297Z] Virtualization type: full [2020-10-27T16:59:28.297Z] L1d cache: 32K [2020-10-27T16:59:28.297Z] L1i cache: 32K [2020-10-27T16:59:28.297Z] L2 cache: 1024K [2020-10-27T16:59:28.297Z] L3 cache: 36608K [2020-10-27T16:59:28.297Z] NUMA node0 CPU(s): 0-15 [2020-10-27T16:59:28.297Z] Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni [2020-10-27T16:59:28.298Z] ----------Network Test---------- [2020-10-27T16:59:28.298Z] Setting timeout: 10 [2020-10-27T16:59:28.766Z] Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0007 sec, LOAD: 0.4678 sec. [2020-10-27T16:59:29.018Z] Timing for Gluon Tutorial(en): http://gluon.mxnet.io, DNS: 0.0861 sec, LOAD: 0.1656 sec. [2020-10-27T16:59:29.168Z] Error open Gluon Tutorial(cn): https://zh.gluon.ai, , DNS finished in 0.11675071716308594 sec. [2020-10-27T16:59:29.307Z] Timing for FashionMNIST: https://apache-mxnet.s3-accelerate.dualstack.amazonaws.com/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0076 sec, LOAD: 0.1308 sec. [2020-10-27T16:59:29.489Z] Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0034 sec, LOAD: 0.1785 sec. [2020-10-27T16:59:29.564Z] Error open Conda: https://repo.continuum.io/pkgs/free/, HTTP Error 403: Forbidden, DNS finished in 0.02842235565185547 sec. [2020-10-27T16:59:29.564Z] ----------Environment---------- ```
barry-jin commented 4 years ago

Update

To Reproduce

It is able to reproduce this error by running a small set of tests.

python3 -m pip install -U --quiet --pre "mxnet-cu102==2.0.0b20201022" -f https://dist.mxnet.io/python
git clone https://github.com/dmlc/gluon-nlp; cd gluon-nlp
git checkout master
python3 -m pip install --quiet -e .[extras]
python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py
Error Message ``` Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1362441855 to reproduce. ============================== test session starts =============================== platform linux -- Python 3.6.9, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /usr/bin/python3 cachedir: .pytest_cache rootdir: /workspace/gluon-nlp, configfile: pytest.ini plugins: cov-2.10.1 collected 95 items tests/test_models.py::test_list_backbone_names PASSED [ 1%] tests/test_models.py::test_get_backbone[ctx0-google_albert_base_v2] PASSED [ 2%] tests/test_models.py::test_get_backbone[ctx0-google_albert_large_v2] PASSED [ 3%] tests/test_models.py::test_get_backbone[ctx0-google_albert_xlarge_v2] PASSED [ 4%] tests/test_models.py::test_get_backbone[ctx0-google_albert_xxlarge_v2] PASSED [ 5%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_base] PASSED [ 6%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_large] PASSED [ 7%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_wwm_large] PASSED [ 8%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_base] PASSED [ 9%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_large] PASSED [ 10%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 11%] tests/test_models.py::test_get_backbone[ctx0-google_multi_cased_bert_base] PASSED [ 12%] tests/test_models.py::test_get_backbone[ctx0-google_zh_bert_base] PASSED [ 13%] tests/test_models.py::test_get_backbone[ctx0-gluon_electra_small_owt] PASSED [ 14%] tests/test_models.py::test_get_backbone[ctx0-google_electra_base] PASSED [ 15%] tests/test_models.py::test_get_backbone[ctx0-google_electra_large] PASSED [ 16%] tests/test_models.py::test_get_backbone[ctx0-google_electra_small] PASSED [ 17%] tests/test_models.py::test_get_backbone[ctx0-gpt2_124M] PASSED [ 18%] tests/test_models.py::test_get_backbone[ctx0-gpt2_1558M] PASSED [ 20%] tests/test_models.py::test_get_backbone[ctx0-gpt2_355M] PASSED [ 21%] tests/test_models.py::test_get_backbone[ctx0-gpt2_774M] PASSED [ 22%] tests/test_models.py::test_get_backbone[ctx0-google_uncased_mobilebert] PASSED [ 23%] tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_base] PASSED [ 24%] tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_large] PASSED [ 25%] tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_base] PASSED [ 26%] tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_large] PASSED [ 27%] tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_base] PASSED [ 28%] tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_large] PASSED [ 29%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_albert_base_v2] PASSED [ 30%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_en_cased_bert_base] PASSED [ 31%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_electra_small] PASSED [ 32%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-fairseq_bart_base] PASSED [ 33%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_albert_base_v2] PASSED [ 34%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_en_cased_bert_base] PASSED [ 35%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_electra_small] PASSED [ 36%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-fairseq_bart_base] PASSED [ 37%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_albert_base_v2] PASSED [ 38%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_en_cased_bert_base] PASSED [ 40%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_electra_small] PASSED [ 41%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-fairseq_bart_base] PASSED [ 42%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_albert_base_v2] PASSED [ 43%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_en_cased_bert_base] PASSED [ 44%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_electra_small] PASSED [ 45%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-fairseq_bart_base] PASSED [ 46%] tests/test_models_albert.py::test_albert_backbone[auto-False-False] PASSED [ 47%] tests/test_models_albert.py::test_albert_backbone[auto-True-True] PASSED [ 48%] tests/test_models_albert.py::test_albert_backbone[NT-False-False] PASSED [ 49%] tests/test_models_albert.py::test_albert_backbone[NT-True-True] PASSED [ 50%] tests/test_models_albert.py::test_albert_backbone[TN-False-False] PASSED [ 51%] tests/test_models_albert.py::test_albert_backbone[TN-True-True] PASSED [ 52%] tests/test_models_albert.py::test_albert_for_mlm_model[auto] PASSED [ 53%] tests/test_models_albert.py::test_albert_for_mlm_model[NT] PASSED [ 54%] tests/test_models_albert.py::test_albert_for_mlm_model[TN] PASSED [ 55%] tests/test_models_albert.py::test_albert_for_pretrain_model[auto] PASSED [ 56%] tests/test_models_albert.py::test_albert_for_pretrain_model[NT] PASSED [ 57%] tests/test_models_albert.py::test_albert_for_pretrain_model[TN] PASSED [ 58%] tests/test_models_albert.py::test_list_pretrained_albert PASSED [ 60%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_base_v2] PASSED [ 61%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_large_v2] PASSED [ 62%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xlarge_v2] PASSED [ 63%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xxlarge_v2] PASSED [ 64%] tests/test_models_bart.py::test_list_pretrained_bart PASSED [ 65%] tests/test_models_bart.py::test_bart[fairseq_bart_base] PASSED [ 66%] tests/test_models_bart.py::test_bart[fairseq_bart_large] PASSED [ 67%] tests/test_models_bart.py::test_bart_cfg_registry PASSED [ 68%] tests/test_models_bart.py::test_bart_cfg[bart_base] PASSED [ 69%] tests/test_models_bart.py::test_bart_cfg[bart_large] PASSED [ 70%] tests/test_models_bert.py::test_list_pretrained_bert PASSED [ 71%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-auto] PASSED [ 72%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-NT] PASSED [ 73%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-TN] PASSED [ 74%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_base] PASSED [ 75%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_large] PASSED [ 76%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_wwm_large] PASSED [ 77%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_base] PASSED [ 78%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_large] PASSED [ 80%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 81%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_multi_cased_bert_base] PASSED [ 82%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_zh_bert_base] PASSED [ 83%] tests/test_models_electra.py::test_list_pretrained_electra PASSED [ 84%] tests/test_models_electra.py::test_electra_model[ctx0-auto] PASSED [ 85%] tests/test_models_electra.py::test_electra_model[ctx0-NT] PASSED [ 86%] tests/test_models_electra.py::test_electra_model[ctx0-TN] PASSED [ 87%] tests/test_models_electra.py::test_electra_get_pretrained[ctx0-gluon_electra_small_owt] PASSED [ 88%] tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_base] PASSED [ 89%] tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_large] PASSED [ 90%] tests/test_models_electra.py::test_electra_get_pretrained[ctx0-google_electra_small] PASSED [ 91%] tests/test_models_gpt2.py::test_list_pretrained_gpt2 PASSED [ 92%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-auto] PASSED [ 93%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-TN] PASSED [ 94%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-NT] PASSED [ 95%] tests/test_models_gpt2.py::test_gpt2_incremental_states[ctx0] PASSED [ 96%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_124M] PASSED [ 97%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_355M] PASSED [ 98%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] FAILED [100%] ============================================================== FAILURES ============================================================== _____________________________________________________ test_gpt2[ctx0-gpt2_774M] ______________________________________________________ model_name = 'gpt2_774M', ctx = gpu(0) @pytest.mark.slow @pytest.mark.remote_required @pytest.mark.parametrize('model_name', ['gpt2_124M', 'gpt2_355M', 'gpt2_774M']) def test_gpt2(model_name, ctx): # test from pretrained assert len(list_pretrained_gpt2()) > 0 with tempfile.TemporaryDirectory() as root, ctx: cfg, tokenizer, params_path, lm_params_path =\ get_pretrained_gpt2(model_name, load_backbone=True, load_lm=True, root=root) assert cfg.MODEL.vocab_size == len(tokenizer.vocab) # test backbone gpt2_model = GPT2Model.from_cfg(cfg) gpt2_model.load_parameters(params_path) # test lm model gpt2_lm_model = GPT2ForLM(cfg) gpt2_lm_model.load_parameters(lm_params_path) # test forward batch_size = 3 seq_length = 32 vocab_size = len(tokenizer.vocab) input_ids = mx.np.array( np.random.randint( 2, vocab_size, (batch_size, seq_length) ), dtype=np.int32, ctx=ctx ) logits, _ = gpt2_lm_model( input_ids, gpt2_lm_model.init_states(batch_size, ctx) ) > mx.npx.waitall() tests/test_models_gpt2.py:142: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ /usr/local/lib/python3.6/dist-packages/mxnet/ndarray/ndarray.py:240: in waitall check_call(_LIB.MXNDArrayWaitAll()) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ret = -1 def check_call(ret): """Check the return value of C API call. This function will raise an exception when an error occurs. Wrap every API call with this function. Parameters ---------- ret : int return value from API calls. """ if ret != 0: > raise get_last_ffi_error() E mxnet.base.MXNetError: Traceback (most recent call last): E File "../src/storage/./pooled_storage_manager.h", line 192 E MXNetError: Memory allocation failed out of memory /usr/local/lib/python3.6/dist-packages/mxnet/base.py:246: MXNetError -------------------------------------------------------- Captured stdout call -------------------------------------------------------- Downloading /tmp/tmpbj080s2v/gpt2_774M/gpt2-9dc62091.vocab from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-9dc62091.vocab... Downloading /tmp/tmpbj080s2v/gpt2_774M/gpt2-396d4d8e.merges from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-396d4d8e.merges... Downloading /tmp/tmpbj080s2v/gpt2_774M/model-9917e24e.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model-9917e24e.params... Downloading /tmp/tmpbj080s2v/gpt2_774M/model_lm-cfbfa641.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model_lm-cfbfa641.params... -------------------------------------------------------- Captured stderr call -------------------------------------------------------- 100%|██████████| 558k/558k [00:00<00:00, 7.15MiB/s] 100%|██████████| 456k/456k [00:00<00:00, 6.39MiB/s] 100%|██████████| 3.10G/3.10G [01:16<00:00, 40.5MiB/s] 100%|██████████| 3.10G/3.10G [01:20<00:00, 38.6MiB/s] ========================================================== warnings summary ========================================================== src/gluonnlp/attention_cell.py:715 /workspace/gluon-nlp/src/gluonnlp/attention_cell.py:715: DeprecationWarning: invalid escape sequence \s """ src/gluonnlp/op.py:226 /workspace/gluon-nlp/src/gluonnlp/op.py:226: DeprecationWarning: invalid escape sequence \p """ tests/test_models_albert.py: 6 warnings tests/test_models_bart.py: 2 warnings tests/test_models_bert.py: 3 warnings tests/test_models_gpt2.py: 3 warnings /usr/local/lib/python3.6/dist-packages/mxnet/gluon/block.py:572: UserWarning: Parameter 'weight' is already initialized, ignoring. Set force_reinit=True to re-initialize. v.initialize(None, ctx, init, force_reinit=force_reinit) -- Docs: https://docs.pytest.org/en/stable/warnings.html ====================================================== short test summary info ======================================================= FAILED tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] - mxnet.base.MXNetError: Traceback (most recent call last): ======================================= 1 failed, 94 passed, 16 warnings in 1990.67s (0:33:10) ======================================= ```

Possible memory leak.

There is possible GPU memory leak when running test_models.py::test_tvm_integration on 10.22 nightly release.

python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py::test_tvm_integration
Screen Shot 2020-11-04 at 9 39 48 AM Screen Shot 2020-11-04 at 9 44 52 AM Screen Shot 2020-11-04 at 9 40 09 AM Screen Shot 2020-11-04 at 9 45 13 AM
barry-jin commented 3 years ago

Here are the logs before and after reverting #19378

Before Revert ``` root@6a1ad75b3392:/workspace/incubator-mxnet# git log -1 commit 43750c8bfed6ca91fc47fd1fa6d620197e26c84c (HEAD) Author: Przemyslaw Tredak Date: Wed Oct 21 11:50:12 2020 -0700 Remove cleanup on side threads (#19378) * Remove cleanup on side threads * removed comment root@6a1ad75b3392:/workspace/incubator-mxnet# cd ../gluon-nlp/ ; python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py tests/test_models_gpt2.py Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1033001789 to reproduce. =================================== test session starts ==================================== platform linux -- Python 3.6.9, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /usr/bin/python3 cachedir: .pytest_cache rootdir: /workspace/gluon-nlp, configfile: pytest.ini plugins: cov-2.10.1 collected 87 items tests/test_models.py::test_list_backbone_names PASSED [ 1%] tests/test_models.py::test_get_backbone[ctx0-google_albert_base_v2] PASSED [ 2%] tests/test_models.py::test_get_backbone[ctx0-google_albert_large_v2] PASSED [ 3%] tests/test_models.py::test_get_backbone[ctx0-google_albert_xlarge_v2] PASSED [ 4%] tests/test_models.py::test_get_backbone[ctx0-google_albert_xxlarge_v2] PASSED [ 5%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_base] PASSED [ 6%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_large] PASSED [ 8%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_wwm_large] PASSED [ 9%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_base] PASSED [ 10%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_large] PASSED [ 11%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 12%] tests/test_models.py::test_get_backbone[ctx0-google_multi_cased_bert_base] PASSED [ 13%] tests/test_models.py::test_get_backbone[ctx0-google_zh_bert_base] PASSED [ 14%] tests/test_models.py::test_get_backbone[ctx0-gluon_electra_small_owt] PASSED [ 16%] tests/test_models.py::test_get_backbone[ctx0-google_electra_base] PASSED [ 17%] tests/test_models.py::test_get_backbone[ctx0-google_electra_large] PASSED [ 18%] tests/test_models.py::test_get_backbone[ctx0-google_electra_small] PASSED [ 19%] tests/test_models.py::test_get_backbone[ctx0-gpt2_124M] PASSED [ 20%] tests/test_models.py::test_get_backbone[ctx0-gpt2_1558M] PASSED [ 21%] tests/test_models.py::test_get_backbone[ctx0-gpt2_355M] PASSED [ 22%] tests/test_models.py::test_get_backbone[ctx0-gpt2_774M] PASSED [ 24%] tests/test_models.py::test_get_backbone[ctx0-google_uncased_mobilebert] PASSED [ 25%] tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_base] PASSED [ 26%] tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_large] PASSED [ 27%] tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_base] PASSED [ 28%] tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_large] PASSED [ 29%] tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_base] PASSED [ 31%] tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_large] PASSED [ 32%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_albert_base_v2] PASSED [ 33%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_en_cased_bert_base] PASSED [ 34%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_electra_small] PASSED [ 35%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-fairseq_bart_base] PASSED [ 36%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_albert_base_v2] PASSED [ 37%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_en_cased_bert_base] PASSED [ 39%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_electra_small] PASSED [ 40%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-fairseq_bart_base] PASSED [ 41%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_albert_base_v2] PASSED [ 42%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_en_cased_bert_base] PASSED [ 43%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_electra_small] PASSED [ 44%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-fairseq_bart_base] PASSED [ 45%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_albert_base_v2] PASSED [ 47%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_en_cased_bert_base] PASSED [ 48%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_electra_small] PASSED [ 49%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-fairseq_bart_base] PASSED [ 50%] tests/test_models_albert.py::test_albert_backbone[auto-False-False] PASSED [ 51%] tests/test_models_albert.py::test_albert_backbone[auto-True-True] PASSED [ 52%] tests/test_models_albert.py::test_albert_backbone[NT-False-False] PASSED [ 54%] tests/test_models_albert.py::test_albert_backbone[NT-True-True] PASSED [ 55%] tests/test_models_albert.py::test_albert_backbone[TN-False-False] PASSED [ 56%] tests/test_models_albert.py::test_albert_backbone[TN-True-True] PASSED [ 57%] tests/test_models_albert.py::test_albert_for_mlm_model[auto] PASSED [ 58%] tests/test_models_albert.py::test_albert_for_mlm_model[NT] PASSED [ 59%] tests/test_models_albert.py::test_albert_for_mlm_model[TN] PASSED [ 60%] tests/test_models_albert.py::test_albert_for_pretrain_model[auto] PASSED [ 62%] tests/test_models_albert.py::test_albert_for_pretrain_model[NT] PASSED [ 63%] tests/test_models_albert.py::test_albert_for_pretrain_model[TN] PASSED [ 64%] tests/test_models_albert.py::test_list_pretrained_albert PASSED [ 65%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_base_v2] PASSED [ 66%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_large_v2] PASSED [ 67%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xlarge_v2] PASSED [ 68%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xxlarge_v2] PASSED [ 70%] tests/test_models_bart.py::test_list_pretrained_bart PASSED [ 71%] tests/test_models_bart.py::test_bart[fairseq_bart_base] PASSED [ 72%] tests/test_models_bart.py::test_bart[fairseq_bart_large] PASSED [ 73%] tests/test_models_bart.py::test_bart_cfg_registry PASSED [ 74%] tests/test_models_bart.py::test_bart_cfg[bart_base] PASSED [ 75%] tests/test_models_bart.py::test_bart_cfg[bart_large] PASSED [ 77%] tests/test_models_bert.py::test_list_pretrained_bert PASSED [ 78%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-auto] PASSED [ 79%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-NT] PASSED [ 80%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-TN] PASSED [ 81%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_base] PASSED [ 82%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_large] PASSED [ 83%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_wwm_large] PASSED [ 85%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_base] PASSED [ 86%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_large] PASSED [ 87%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 88%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_multi_cased_bert_base] PASSED [ 89%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_zh_bert_base] PASSED [ 90%] tests/test_models_gpt2.py::test_list_pretrained_gpt2 PASSED [ 91%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-auto] PASSED [ 93%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-TN] PASSED [ 94%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-NT] PASSED [ 95%] tests/test_models_gpt2.py::test_gpt2_incremental_states[ctx0] PASSED [ 96%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_124M] PASSED [ 97%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_355M] PASSED [ 98%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] FAILED [100%] ========================================= FAILURES ========================================= ________________________________ test_gpt2[ctx0-gpt2_774M] _________________________________ model_name = 'gpt2_774M', ctx = gpu(0) @pytest.mark.slow @pytest.mark.remote_required @pytest.mark.parametrize('model_name', ['gpt2_124M', 'gpt2_355M', 'gpt2_774M']) def test_gpt2(model_name, ctx): # test from pretrained assert len(list_pretrained_gpt2()) > 0 with tempfile.TemporaryDirectory() as root, ctx: cfg, tokenizer, params_path, lm_params_path =\ get_pretrained_gpt2(model_name, load_backbone=True, load_lm=True, root=root) assert cfg.MODEL.vocab_size == len(tokenizer.vocab) # test backbone gpt2_model = GPT2Model.from_cfg(cfg) gpt2_model.load_parameters(params_path) # test lm model gpt2_lm_model = GPT2ForLM(cfg) gpt2_lm_model.load_parameters(lm_params_path) # test forward batch_size = 3 seq_length = 32 vocab_size = len(tokenizer.vocab) input_ids = mx.np.array( np.random.randint( 2, vocab_size, (batch_size, seq_length) ), dtype=np.int32, ctx=ctx ) logits, _ = gpt2_lm_model( input_ids, gpt2_lm_model.init_states(batch_size, ctx) ) > mx.npx.waitall() tests/test_models_gpt2.py:142: _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ../incubator-mxnet/python/mxnet/ndarray/ndarray.py:240: in waitall check_call(_LIB.MXNDArrayWaitAll()) _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ ret = -1 def check_call(ret): """Check the return value of C API call. This function will raise an exception when an error occurs. Wrap every API call with this function. Parameters ---------- ret : int return value from API calls. """ if ret != 0: > raise get_last_ffi_error() E mxnet.base.MXNetError: Traceback (most recent call last): E File "../src/storage/./pooled_storage_manager.h", line 192 E MXNetError: Memory allocation failed out of memory ../incubator-mxnet/python/mxnet/base.py:246: MXNetError ----------------------------------- Captured stdout call ----------------------------------- Downloading /tmp/tmpzxj5da72/gpt2_774M/gpt2-9dc62091.vocab from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-9dc62091.vocab... Downloading /tmp/tmpzxj5da72/gpt2_774M/gpt2-396d4d8e.merges from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/gpt2-396d4d8e.merges... Downloading /tmp/tmpzxj5da72/gpt2_774M/model-9917e24e.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model-9917e24e.params... Downloading /tmp/tmpzxj5da72/gpt2_774M/model_lm-cfbfa641.params from https://gluonnlp-numpy-data.s3-accelerate.amazonaws.com/models/gpt2_774M/model_lm-cfbfa641.params... ----------------------------------- Captured stderr call ----------------------------------- 100%|██████████| 558k/558k [00:00<00:00, 3.45MiB/s] 100%|██████████| 456k/456k [00:00<00:00, 4.16MiB/s] 100%|██████████| 3.10G/3.10G [01:07<00:00, 45.9MiB/s] 100%|██████████| 3.10G/3.10G [01:18<00:00, 39.4MiB/s] ===================================== warnings summary ===================================== ../incubator-mxnet/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py:67 /workspace/incubator-mxnet/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py:67: DeprecationWarning: invalid escape sequence \( tuple_re = re.compile('\([0-9L|,| ]+\)') src/gluonnlp/attention_cell.py:715 /workspace/gluon-nlp/src/gluonnlp/attention_cell.py:715: DeprecationWarning: invalid escape sequence \s """ src/gluonnlp/op.py:226 /workspace/gluon-nlp/src/gluonnlp/op.py:226: DeprecationWarning: invalid escape sequence \p """ tests/test_models_albert.py: 6 warnings tests/test_models_bart.py: 2 warnings tests/test_models_bert.py: 3 warnings tests/test_models_gpt2.py: 3 warnings /workspace/incubator-mxnet/python/mxnet/gluon/block.py:572: UserWarning: Parameter 'weight' is already initialized, ignoring. Set force_reinit=True to re-initialize. v.initialize(None, ctx, init, force_reinit=force_reinit) -- Docs: https://docs.pytest.org/en/stable/warnings.html ================================= short test summary info ================================== FAILED tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] - mxnet.base.MXNetError: Trac... ================== 1 failed, 86 passed, 17 warnings in 1718.22s (0:28:38) ================== root@6a1ad75b3392:/workspace/gluon-nlp# ```
After Revert ``` root@6a1ad75b3392:/workspace/incubator-mxnet# git log -1 commit d786518725ebfdfceeea7b09d3ecb8edf6bbbfaa (HEAD) Author: barry-jin Date: Tue Dec 8 21:42:28 2020 +0000 Revert "Remove cleanup on side threads (#19378)" This reverts commit 43750c8bfed6ca91fc47fd1fa6d620197e26c84c. root@6a1ad75b3392:/workspace/incubator-mxnet# cd ../gluon-nlp/ ; python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py tests/test_models_gpt2.py Setting module np/mx/python random seeds, use MXNET_MODULE_SEED=1725596454 to reproduce. =================================== test session starts ==================================== platform linux -- Python 3.6.9, pytest-6.1.2, py-1.9.0, pluggy-0.13.1 -- /usr/bin/python3 cachedir: .pytest_cache rootdir: /workspace/gluon-nlp, configfile: pytest.ini plugins: cov-2.10.1 collected 87 items tests/test_models.py::test_list_backbone_names PASSED [ 1%] tests/test_models.py::test_get_backbone[ctx0-google_albert_base_v2] PASSED [ 2%] tests/test_models.py::test_get_backbone[ctx0-google_albert_large_v2] PASSED [ 3%] tests/test_models.py::test_get_backbone[ctx0-google_albert_xlarge_v2] PASSED [ 4%] tests/test_models.py::test_get_backbone[ctx0-google_albert_xxlarge_v2] PASSED [ 5%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_base] PASSED [ 6%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_large] PASSED [ 8%] tests/test_models.py::test_get_backbone[ctx0-google_en_cased_bert_wwm_large] PASSED [ 9%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_base] PASSED [ 10%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_large] PASSED [ 11%] tests/test_models.py::test_get_backbone[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 12%] tests/test_models.py::test_get_backbone[ctx0-google_multi_cased_bert_base] PASSED [ 13%] tests/test_models.py::test_get_backbone[ctx0-google_zh_bert_base] PASSED [ 14%] tests/test_models.py::test_get_backbone[ctx0-gluon_electra_small_owt] PASSED [ 16%] tests/test_models.py::test_get_backbone[ctx0-google_electra_base] PASSED [ 17%] tests/test_models.py::test_get_backbone[ctx0-google_electra_large] PASSED [ 18%] tests/test_models.py::test_get_backbone[ctx0-google_electra_small] PASSED [ 19%] tests/test_models.py::test_get_backbone[ctx0-gpt2_124M] PASSED [ 20%] tests/test_models.py::test_get_backbone[ctx0-gpt2_1558M] PASSED [ 21%] tests/test_models.py::test_get_backbone[ctx0-gpt2_355M] PASSED [ 22%] tests/test_models.py::test_get_backbone[ctx0-gpt2_774M] PASSED [ 24%] tests/test_models.py::test_get_backbone[ctx0-google_uncased_mobilebert] PASSED [ 25%] tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_base] PASSED [ 26%] tests/test_models.py::test_get_backbone[ctx0-fairseq_roberta_large] PASSED [ 27%] tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_base] PASSED [ 28%] tests/test_models.py::test_get_backbone[ctx0-fairseq_xlmr_large] PASSED [ 29%] tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_base] PASSED [ 31%] tests/test_models.py::test_get_backbone[ctx0-fairseq_bart_large] PASSED [ 32%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_albert_base_v2] PASSED [ 33%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_en_cased_bert_base] PASSED [ 34%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-google_electra_small] PASSED [ 35%] tests/test_models.py::test_tvm_integration[ctx0-NT-2-4-fairseq_bart_base] PASSED [ 36%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_albert_base_v2] PASSED [ 37%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_en_cased_bert_base] PASSED [ 39%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-google_electra_small] PASSED [ 40%] tests/test_models.py::test_tvm_integration[ctx0-NT-1-4-fairseq_bart_base] PASSED [ 41%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_albert_base_v2] PASSED [ 42%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_en_cased_bert_base] PASSED [ 43%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-google_electra_small] PASSED [ 44%] tests/test_models.py::test_tvm_integration[ctx0-TN-2-4-fairseq_bart_base] PASSED [ 45%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_albert_base_v2] PASSED [ 47%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_en_cased_bert_base] PASSED [ 48%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-google_electra_small] PASSED [ 49%] tests/test_models.py::test_tvm_integration[ctx0-TN-1-4-fairseq_bart_base] PASSED [ 50%] tests/test_models_albert.py::test_albert_backbone[auto-False-False] PASSED [ 51%] tests/test_models_albert.py::test_albert_backbone[auto-True-True] PASSED [ 52%] tests/test_models_albert.py::test_albert_backbone[NT-False-False] PASSED [ 54%] tests/test_models_albert.py::test_albert_backbone[NT-True-True] PASSED [ 55%] tests/test_models_albert.py::test_albert_backbone[TN-False-False] PASSED [ 56%] tests/test_models_albert.py::test_albert_backbone[TN-True-True] PASSED [ 57%] tests/test_models_albert.py::test_albert_for_mlm_model[auto] PASSED [ 58%] tests/test_models_albert.py::test_albert_for_mlm_model[NT] PASSED [ 59%] tests/test_models_albert.py::test_albert_for_mlm_model[TN] PASSED [ 60%] tests/test_models_albert.py::test_albert_for_pretrain_model[auto] PASSED [ 62%] tests/test_models_albert.py::test_albert_for_pretrain_model[NT] PASSED [ 63%] tests/test_models_albert.py::test_albert_for_pretrain_model[TN] PASSED [ 64%] tests/test_models_albert.py::test_list_pretrained_albert PASSED [ 65%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_base_v2] PASSED [ 66%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_large_v2] PASSED [ 67%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xlarge_v2] PASSED [ 68%] tests/test_models_albert.py::test_albert_get_pretrained[google_albert_xxlarge_v2] PASSED [ 70%] tests/test_models_bart.py::test_list_pretrained_bart PASSED [ 71%] tests/test_models_bart.py::test_bart[fairseq_bart_base] PASSED [ 72%] tests/test_models_bart.py::test_bart[fairseq_bart_large] PASSED [ 73%] tests/test_models_bart.py::test_bart_cfg_registry PASSED [ 74%] tests/test_models_bart.py::test_bart_cfg[bart_base] PASSED [ 75%] tests/test_models_bart.py::test_bart_cfg[bart_large] PASSED [ 77%] tests/test_models_bert.py::test_list_pretrained_bert PASSED [ 78%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-auto] PASSED [ 79%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-NT] PASSED [ 80%] tests/test_models_bert.py::test_bert_small_cfg[ctx0-TN] PASSED [ 81%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_base] PASSED [ 82%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_large] PASSED [ 83%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_cased_bert_wwm_large] PASSED [ 85%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_base] PASSED [ 86%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_large] PASSED [ 87%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_en_uncased_bert_wwm_large] PASSED [ 88%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_multi_cased_bert_base] PASSED [ 89%] tests/test_models_bert.py::test_bert_get_pretrained[ctx0-google_zh_bert_base] PASSED [ 90%] tests/test_models_gpt2.py::test_list_pretrained_gpt2 PASSED [ 91%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-auto] PASSED [ 93%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-TN] PASSED [ 94%] tests/test_models_gpt2.py::test_gpt2_small_config[ctx0-NT] PASSED [ 95%] tests/test_models_gpt2.py::test_gpt2_incremental_states[ctx0] PASSED [ 96%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_124M] PASSED [ 97%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_355M] PASSED [ 98%] tests/test_models_gpt2.py::test_gpt2[ctx0-gpt2_774M] PASSED [100%] ===================================== warnings summary ===================================== ../incubator-mxnet/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py:67 /workspace/incubator-mxnet/python/mxnet/contrib/onnx/mx2onnx/_op_translations.py:67: DeprecationWarning: invalid escape sequence \( tuple_re = re.compile('\([0-9L|,| ]+\)') src/gluonnlp/attention_cell.py:715 /workspace/gluon-nlp/src/gluonnlp/attention_cell.py:715: DeprecationWarning: invalid escape sequence \s """ src/gluonnlp/op.py:226 /workspace/gluon-nlp/src/gluonnlp/op.py:226: DeprecationWarning: invalid escape sequence \p """ tests/test_models_albert.py: 6 warnings tests/test_models_bart.py: 2 warnings tests/test_models_bert.py: 3 warnings tests/test_models_gpt2.py: 3 warnings /workspace/incubator-mxnet/python/mxnet/gluon/block.py:572: UserWarning: Parameter 'weight' is already initialized, ignoring. Set force_reinit=True to re-initialize. v.initialize(None, ctx, init, force_reinit=force_reinit) -- Docs: https://docs.pytest.org/en/stable/warnings.html ======================= 87 passed, 17 warnings in 1928.37s (0:32:08) ======================= root@6a1ad75b3392:/workspace/gluon-nlp# ```
andrei5055 commented 3 years ago

@barry-jin : To investigate this problem I need to compile MxNet locally. Do you know what set of cmake options I need to use for that?

barry-jin commented 3 years ago

From my experience, I just used following commands to build MxNet locally and reproduce the issue:

$ git clone --recursive https://github.com/apache/incubator-mxnet
$ cd incubator-mxnet
$ git checkout 43750c8bfed6ca91fc47fd1fa6d620197e26c84c
$ cp config/linux_gpu.cmake config.cmake
$ mkdir build; cd build
$ cmake -GNinja -DCMAKE_BUILD_TYPE=Debug ..; ninja
$ cd ..
$ python3 -m pip install --user -e ./python
$ cd ~/workspace
$ git clone https://github.com/dmlc/gluon-nlp
$ cd ~/workspace/gluon-nlp
$ git checkout 8c8b0c9cda0853caa88fdbf4e0544986fbef243c
$ python3 -m pip install --quiet -e .[extras]
$ python3 -m pytest --device='gpu' --verbose --runslow tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py tests/test_models_gpt2.py
andrei5055 commented 3 years ago

Thanks a lot for the script! Unfortunately, I am having a linking problem:

root@28b3a2b8de7a:/opt/mxnet/build# ninja
[1/3] Linking CXX shared library libmxnet.so
FAILED: libmxnet.so 
. . .
Error copying file "/opt/mxnet/build/3rdparty/mkldnn/include/dnnl_config.h" to "/opt/mxnet/include/mkldnn/".
ninja: build stopped: subcommand failed.

The file dnnl_config.h is not presented in any part of incubator-mxnet

barry-jin commented 3 years ago

You may try to update 3rdparty modules

$ git clean -ffxd
$ git submodule update --init --recursive
andrei5055 commented 3 years ago

@barry-jin : Is it true, that the script you gave me should reproduce this problem? I tried, and I don't see it: ==== 71 passed, 16 skipped, 17 warnings in 1528.46s (0:25:28) ==== Just in case... The 16 tests were skipped, because "JVM is not supported". I'm not sure if a memory problem will show up in one of these tests.

barry-jin commented 3 years ago

@andrei5055 Thanks for your investigation. I think the warning message should be "TVM is not supported". You can follow tvm documentation to install tvm. Alternatively, I will provide test suite without tvm support that will reproduce this issue.

barry-jin commented 3 years ago

You can checkout gluon-nlp to https://github.com/dmlc/gluon-nlp/commit/7910d6d247ec9cb1b51cd49d79e3d474b087b188 and run following test suite.

git checkout 7910d6d247ec9cb1b51cd49d79e3d474b087b188
python3 -m pytest --device='gpu' --verbose --runslow tests/test_attention_cell.py tests/test_data_batchify.py tests/test_data_filtering.py tests/test_data_sampler.py tests/test_data_tokenizers.py tests/test_embedding.py tests/test_gluon_block.py tests/test_initializer.py tests/test_layers.py tests/test_loss.py tests/test_models.py tests/test_models_albert.py tests/test_models_bart.py tests/test_models_bert.py tests/test_models_electra.py tests/test_models_gpt2.py tests/test_models_roberta.py tests/test_models_transformer.py
andrei5055 commented 3 years ago

@barry-jin: Still cannot reproduce this problem: ========== 933 passed, 847 warnings in 2932.28s (0:48:52) =======

BTW, all warnings are of following two types: Type 1:

  /opt/mxnet/python/mxnet/gluon/block.py:1098: UserWarning: Parameter 0b7a2e74_c816_4146_bbb2_7973d2ca9112_gamma, 0af6619c_7075_430a_9226_8458e6ca733a_bias, c75fe6d3_81e7_4748_9894_f49abf4b5f2a_bias, 53661f2f_d20f_4c90_a539_173394b859d3_weight, 2b4ce060_94a7_4cd1_ac29_4bdc41789888_weight, e19ccd3d_cc61_44b2_ab1a_20e88f571877_bias, 8f53b519_069f_415a_bd05_c8b4ec58dd24_const, 99d015d6_eeca_4ad6_9fc6_1fb55e43b0f7_weight, 711c0a20_91e2_43c3_ba41_48f5fd2a3398_gamma, d852d48d_ca52_408a_83f3_2c11bf3a01b8_beta, e0417d39_d73a_4101_a440_f992b45a176e_weight, 3f5329d5_0903_448a_8c7a_65536aa507a1_bias, d08c8d34_3bca_4006_9843_aa5d069767cf_beta is not used by any computation. Is this intended?
    self._build_cache(*args)

Type 2:

  /opt/mxnet/python/mxnet/registry.py:108: UserWarning: New initializer mxnet.gluon.parameter.Init registered with name constant_140658119590520 isoverriding existing initializer mxnet.gluon.parameter.Init
    register(klass, name)