Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
Garbage collection can trigger _PyObject_AssertFailed if Gluon Block hooks are used.
Error Message
PYTHONTRACEMALLOC=1 ~/.pyenv/versions/3.8.2-debug/bin/python ~/test.py ~/src/mxnet-master/python master + ip-172-31-95-96
Modules/gcmodule.c:110: gc_decref: Assertion "gc_get_refs(g) > 0" failed: refcount is too small
Memory block allocated at (most recent call first):
File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/utils.py", line 398
object address : 0x7f3180489bf0
object refcount : 1
object type : 0x5568872ac1a0
object type name: weakref
object repr : <weakref at 0x7f3180489bf0; to 'collections.OrderedDict' at 0x7f31873d9550>
Fatal Python error: _PyObject_AssertFailed
Python runtime state: initialized
Current thread 0x00007f31de3ad3c0 (most recent call first):
File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 620 in register_forward_hook
File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 789 in _register_summary_hook
File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 637 in apply
File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 636 in apply
File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 636 in apply
File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 798 in summary
File "/home/ubuntu/test.py", line 5 in <module>
zsh: abort (core dumped) PYTHONTRACEMALLOC=1 ~/.pyenv/versions/3.8.2-debug/bin/python ~/test.py
To Reproduce
To deterministically reproduce, apply this patch
diff --git a/python/mxnet/gluon/block.py b/python/mxnet/gluon/block.py
index e925b31a2..220cc55ef 100644
--- a/python/mxnet/gluon/block.py
+++ b/python/mxnet/gluon/block.py
@@ -26,7 +26,7 @@ import warnings
import re
from collections import OrderedDict, defaultdict
import numpy as np
-
+import gc
from ..base import mx_real_t, MXNetError
from .. import symbol, ndarray, initializer, np_symbol
from ..symbol import Symbol
@@ -617,6 +617,7 @@ class Block(object):
"""
handle = HookHandle()
handle.attach(self._forward_hooks, hook)
+ gc.collect()
return handle
def apply(self, fn):
The issue will also occur non-deterministically when garbage collection is triggered in the background.
Steps to reproduce
Use Python debug build. For example, with pyenv, run pyenv install --debug 3.8.2.
Apply patch to MXNet
Run
import mxnet as mx
net = mx.gluon.nn.Dense(128)
net.initialize()
net.summary(mx.nd.ones((8, 64)))
Environment
----------Python Info----------
Version : 3.8.2
Compiler : GCC 7.4.0
Build : ('default', 'Feb 27 2020 20:16:52')
Arch : ('64bit', 'ELF')
------------Pip Info-----------
Version : 19.2.3
Directory : /home/ubuntu/.pyenv/versions/3.8.2-debug/lib/python3.8/site-packages/pip
----------MXNet Info-----------
Version : 1.6.0
Directory : /home/ubuntu/src/mxnet-master/python/mxnet
Num GPUs : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform : Linux-4.15.0-1058-aws-x86_64-with-glibc2.27
system : Linux
node : ip-172-31-95-96
release : 4.15.0-1058-aws
version : #60-Ubuntu SMP Wed Jan 15 22:35:20 UTC 2020
----------Hardware Info----------
machine : x86_64
processor : x86_64
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 24
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 85
Model name: Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping: 7
CPU MHz: 1439.236
BogoMIPS: 5999.99
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 1024K
L3 cache: 36608K
NUMA node0 CPU(s): 0-23,48-71
NUMA node1 CPU(s): 24-47,72-95
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0032 sec, LOAD: 0.4806 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0002 sec, LOAD: 0.4295 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.1424 sec, LOAD: 0.4435 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0203 sec, LOAD: 0.0935 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0028 sec, LOAD: 0.2676 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0541 sec, LOAD: 0.3424 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0021 sec, LOAD: 0.0851 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0016 sec, LOAD: 0.0299 sec.
Description
Garbage collection can trigger
_PyObject_AssertFailed
if Gluon Block hooks are used.Error Message
To Reproduce
To deterministically reproduce, apply this patch
The issue will also occur non-deterministically when garbage collection is triggered in the background.
Steps to reproduce
pyenv install --debug 3.8.2
.Environment