apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.76k stars 6.8k forks source link

`mx.contrib.autograd.mark_variables` segmentation fault #18944

Open DNXie opened 4 years ago

DNXie commented 4 years ago

Description

(A clear and concise description of what the bug is.)

mx.contrib.autograd.mark_variables throws segmentation fault when variables and gradients are ndarray. It is also reproducible in nightly version

Error Message

(Paste the complete error message. Please also include stack trace by setting environment variable DMLC_LOG_STACK_TRACE_DEPTH=10 before running your script.)


Segmentation fault: 11

Stack trace:
  [bt] (0) /root/miniconda3/lib/python3.7/site-packages/mxnet/libmxnet.so(+0x3c27360) [0x7ff946381360]

To Reproduce

(If you developed your own code, please provide a short script that reproduces the error. For existing examples, please provide link.)

import mxnet as mx
mx.contrib.autograd.mark_variables(variables=mx.nd.ones((2)), gradients=mx.nd.ones((1)))

Environment

We recommend using our script for collecting the diagnositc information. Run the following command and paste the outputs below:

curl --retry 10 -s https://raw.githubusercontent.com/dmlc/gluon-nlp/master/tools/diagnose.py | python

# paste outputs here

OS: ubuntu 18.04 Python: 3.7.6 pip: 20.0.2 numpy: 1.18.5 mxnet: 1.6.0

github-actions[bot] commented 4 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

DNXie commented 4 years ago

The bug is also reproducible in nightly version.

szha commented 4 years ago

thanks for reporting. while I don't think it should segfault and abort, in your program both the variable and grad don't have other references, so the ref count decreases to zero right after creation and thus autograd receives the spaces that are already freed.

DNXie commented 4 years ago

@szha But it is not good to let a program crash just because invalid input, right? Probably it can be fixed with some python exception. Thanks!

leezu commented 4 years ago

@DNXie would you like to contribute a PR to fix the issue? Please build MXNet from source and add the CMAKE_BUILD_TYPE option when calling cmake: cmake -DCMAKE_BUILD_TYPE=Debug ... Then the Stack trace above would point you to the function that needs change for avoiding crash.

szha commented 4 years ago

The mx.autograd version doesn't seem to have the same problem since the following works.

import mxnet as mx
mx.autograd.mark_variables(variables=mx.nd.ones((2)), gradients=mx.nd.ones((1)))

I'm removing the contrib.autograd module in #19046 in master.