apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.78k stars 6.79k forks source link

Seg Fault while using Randomized relu activation function #14447

Open anirudhacharya opened 5 years ago

anirudhacharya commented 5 years ago
import mxnet as mx
import numpy as np
from collections import namedtuple

Batch = namedtuple('Batch', ['data'])
data = mx.sym.Variable('data')
out = mx.sym.LeakyReLU(data=data, act_type='rrelu')
mod = mx.mod.Module(symbol=out, label_names=None)
mod.bind(data_shapes=[('data', (1, 10))])
mod.init_params()

data1 = [mx.nd.ones((1, 10))]
mod.forward(Batch(data1))
print(mod.get_outputs()[0].asnumpy())

Using rrelu activation type of the LeakyRelu operator I either get a seg fault or it errors out with the following stack trace -

Traceback (most recent call last):
  File "/Users/aanirud/Code/scripts/bug.py", line 15, in <module>
    print(mod.get_outputs()[0].asnumpy())
  File "/Users/aanirud/anaconda2/envs/mxnet2.7/lib/python2.7/site-packages/mxnet-1.5.0-py2.7.egg/mxnet/ndarray/ndarray.py", line 1995, in asnumpy
    ctypes.c_size_t(data.size)))
  File "/Users/aanirud/anaconda2/envs/mxnet2.7/lib/python2.7/site-packages/mxnet-1.5.0-py2.7.egg/mxnet/base.py", line 252, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: [21:18:55] include/mxnet/./resource.h:155: Check failed: req.type == ResourceRequest::kTempSpace (459100160 vs. 1) 

Stack trace returned 10 entries:
[bt] (0) 0   libmxnet.so                         0x00000001063f0034 dmlc::StackTrace() + 276
[bt] (1) 1   libmxnet.so                         0x00000001063efdef dmlc::LogMessageFatal::~LogMessageFatal() + 47
[bt] (2) 2   libmxnet.so                         0x0000000106855685 mshadow::Tensor<mshadow::cpu, 1, unsigned int> mxnet::Resource::get_space_typed<mshadow::cpu, 1, unsigned int>(mshadow::Shape<1>, mshadow::Stream<mshadow::cpu>*) const + 277
[bt] (3) 3   libmxnet.so                         0x0000000107aa667e mxnet::op::LeakyReLUOp<mshadow::cpu, float>::Forward(mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&) + 894
[bt] (4) 4   libmxnet.so                         0x0000000107a16283 mxnet::op::OperatorState::Forward(mxnet::OpContext const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&, std::__1::vector<mxnet::OpReqType, std::__1::allocator<mxnet::OpReqType> > const&, std::__1::vector<mxnet::TBlob, std::__1::allocator<mxnet::TBlob> > const&) + 1795
[bt] (5) 5   libmxnet.so                         0x0000000107871cc7 mxnet::exec::StatefulComputeExecutor::Run(mxnet::RunContext, bool) + 87
[bt] (6) 6   libmxnet.so                         0x000000010789d105 std::__1::__function::__func<mxnet::exec::GraphExecutor::CreateCachedSegOpr(unsigned long, unsigned long)::$_7, std::__1::allocator<mxnet::exec::GraphExecutor::CreateCachedSegOpr(unsigned long, unsigned long)::$_7>, void (mxnet::RunContext, mxnet::engine::CallbackOnComplete)>::operator()(mxnet::RunContext&&, mxnet::engine::CallbackOnComplete&&) + 117
[bt] (7) 7   libmxnet.so                         0x0000000107865cdc mxnet::engine::ThreadedEngine::ExecuteOprBlock(mxnet::RunContext, mxnet::engine::OprBlock*) + 652
[bt] (8) 8   libmxnet.so                         0x0000000107869421 mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::'lambda'()::operator()() const::'lambda'(std::__1::shared_ptr<dmlc::ManualEvent>)::operator()(std::__1::shared_ptr<dmlc::ManualEvent>) const + 129
[bt] (9) 9   libmxnet.so                         0x0000000107869337 std::__1::__function::__func<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::'lambda'()::operator()() const::'lambda'(std::__1::shared_ptr<dmlc::ManualEvent>), std::__1::allocator<mxnet::engine::ThreadedEnginePerDevice::PushToExecute(mxnet::engine::OprBlock*, bool)::'lambda'()::operator()() const::'lambda'(std::__1::shared_ptr<dmlc::ManualEvent>)>, void (std::__1::shared_ptr<dmlc::ManualEvent>)>::operator()(std::__1::shared_ptr<dmlc::ManualEvent>&&) + 39

other activation types work fine.

mxnet-label-bot commented 5 years ago

Hey, this is the MXNet Label Bot. Thank you for submitting the issue! I will try and suggest some labels so that the appropriate MXNet community members can help resolve it. Here are my recommended labels: Bug

zachgk commented 5 years ago

@mxnet-label-bot add [Backend, Operator, Bug]

Vikas-kum commented 5 years ago

@anirudh2290 can we close this as Training crash SSD with LeakyReLU(rrelu) #12894 is tracking the same issue.

anirudhacharya commented 5 years ago

@Vikas89 I would prefer to keep this open, as it has a minimum reproducible example. And from the issue description of #12894 it would seem #12894 is a bigger issue as it says "Replacing LeakyReLU with activations at other positions also causes the training to crash".

This issue tracks a specific bug in a specific operator, with a example that will need to be included as a test case once the fix is made.

mseth10 commented 5 years ago

@anirudhacharya , which MXNet version are you using? In case you are using master, can you specify the build flags?

anirudhacharya commented 5 years ago

fyi, PR #14582 is trying to solve this issue.

I used the latest master, cannot recollect the compile flags i had used back then. But this error is reproducible even with the latest PyPi package.

EmilPi commented 5 years ago

Hello, I installed latest 2019-08-23 build using sudo -H pip3 install mxnet-cu100==1.6.0b20190823 - issue still present there.