apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Gluon Block hooks reference counting bug #17712

Closed leezu closed 4 years ago

leezu commented 4 years ago

Description

Garbage collection can trigger _PyObject_AssertFailed if Gluon Block hooks are used.

Error Message

PYTHONTRACEMALLOC=1 ~/.pyenv/versions/3.8.2-debug/bin/python ~/test.py                                                                                 ~/src/mxnet-master/python master + ip-172-31-95-96
Modules/gcmodule.c:110: gc_decref: Assertion "gc_get_refs(g) > 0" failed: refcount is too small
Memory block allocated at (most recent call first):
  File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/utils.py", line 398

object address  : 0x7f3180489bf0
object refcount : 1
object type     : 0x5568872ac1a0
object type name: weakref
object repr     : <weakref at 0x7f3180489bf0; to 'collections.OrderedDict' at 0x7f31873d9550>

Fatal Python error: _PyObject_AssertFailed
Python runtime state: initialized

Current thread 0x00007f31de3ad3c0 (most recent call first):
  File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 620 in register_forward_hook
  File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 789 in _register_summary_hook
  File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 637 in apply
  File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 636 in apply
  File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 636 in apply
  File "/home/ubuntu/src/mxnet-master/python/mxnet/gluon/block.py", line 798 in summary
  File "/home/ubuntu/test.py", line 5 in <module>
zsh: abort (core dumped)  PYTHONTRACEMALLOC=1 ~/.pyenv/versions/3.8.2-debug/bin/python ~/test.py

To Reproduce

To deterministically reproduce, apply this patch

diff --git a/python/mxnet/gluon/block.py b/python/mxnet/gluon/block.py
index e925b31a2..220cc55ef 100644
--- a/python/mxnet/gluon/block.py
+++ b/python/mxnet/gluon/block.py
@@ -26,7 +26,7 @@ import warnings
 import re
 from collections import OrderedDict, defaultdict
 import numpy as np
-
+import gc
 from ..base import mx_real_t, MXNetError
 from .. import symbol, ndarray, initializer, np_symbol
 from ..symbol import Symbol
@@ -617,6 +617,7 @@ class Block(object):
         """
         handle = HookHandle()
         handle.attach(self._forward_hooks, hook)
+        gc.collect()
         return handle

     def apply(self, fn):

The issue will also occur non-deterministically when garbage collection is triggered in the background.

Steps to reproduce

  1. Use Python debug build. For example, with pyenv, run pyenv install --debug 3.8.2.
  2. Apply patch to MXNet
  3. Run
import mxnet as mx

net = mx.gluon.nn.Dense(128)
net.initialize()
net.summary(mx.nd.ones((8, 64)))

Environment

----------Python Info----------
Version      : 3.8.2
Compiler     : GCC 7.4.0
Build        : ('default', 'Feb 27 2020 20:16:52')
Arch         : ('64bit', 'ELF')
------------Pip Info-----------
Version      : 19.2.3
Directory    : /home/ubuntu/.pyenv/versions/3.8.2-debug/lib/python3.8/site-packages/pip
----------MXNet Info-----------
Version      : 1.6.0
Directory    : /home/ubuntu/src/mxnet-master/python/mxnet
Num GPUs     : 0
Hashtag not found. Not installed from pre-built package.
----------System Info----------
Platform     : Linux-4.15.0-1058-aws-x86_64-with-glibc2.27
system       : Linux
node         : ip-172-31-95-96
release      : 4.15.0-1058-aws
version      : #60-Ubuntu SMP Wed Jan 15 22:35:20 UTC 2020
----------Hardware Info----------
machine      : x86_64
processor    : x86_64
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              96
On-line CPU(s) list: 0-95
Thread(s) per core:  2
Core(s) per socket:  24
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz
Stepping:            7
CPU MHz:             1439.236
BogoMIPS:            5999.99
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            1024K
L3 cache:            36608K
NUMA node0 CPU(s):   0-23,48-71
NUMA node1 CPU(s):   24-47,72-95
Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single pti fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves ida arat pku ospke avx512_vnni
----------Network Test----------
Setting timeout: 10
Timing for MXNet: https://github.com/apache/incubator-mxnet, DNS: 0.0032 sec, LOAD: 0.4806 sec.
Timing for GluonNLP GitHub: https://github.com/dmlc/gluon-nlp, DNS: 0.0002 sec, LOAD: 0.4295 sec.
Timing for GluonNLP: http://gluon-nlp.mxnet.io, DNS: 0.1424 sec, LOAD: 0.4435 sec.
Timing for D2L: http://d2l.ai, DNS: 0.0203 sec, LOAD: 0.0935 sec.
Timing for D2L (zh-cn): http://zh.d2l.ai, DNS: 0.0028 sec, LOAD: 0.2676 sec.
Timing for FashionMNIST: https://repo.mxnet.io/gluon/dataset/fashion-mnist/train-labels-idx1-ubyte.gz, DNS: 0.0541 sec, LOAD: 0.3424 sec.
Timing for PYPI: https://pypi.python.org/pypi/pip, DNS: 0.0021 sec, LOAD: 0.0851 sec.
Timing for Conda: https://repo.continuum.io/pkgs/free/, DNS: 0.0016 sec, LOAD: 0.0299 sec.
leezu commented 4 years ago

Seems to be an upstream bug, so I opened https://bugs.python.org/issue39778