apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.79k forks source link

Prediction fails with reused module #7006

Closed dougsm closed 6 years ago

dougsm commented 7 years ago

For bugs or installation issues, please provide the following information. The more information you provide, the more likely people will be able to help you.

Environment info

Operating System: Ubuntu 16.04

Compiler: GCC 5.4.0

Package used (Python/R/Scala/Julia): Python

MXNet version: 0.10.1

Or if installed from source:

MXNet commit hash (git rev-parse HEAD): 202de02cd

If you are using python package, please provide

Python version and distribution: Python 2.7.12

If you are using R package, please provide

R sessionInfo():

Error Message:

No error message. The program hangs at the self.mod.predict(it) line on the second time execute() is called. Recreating the module every time will not cause it to hang but should be unnecessary. Calling it in a loop over a collection of separate images will not cause it to hang.

Minimum reproducible example

if you are using your own code, please provide a short script that reproduces the error.

import datetime
import rospy

CLASSNUM = 59
WORKSPACE = 1000

def create_infer(class_num, workspace=512):
    data = mx.symbol.Variable("data")
    net = create_main(data, class_num, workspace=workspace)
    net = mx.symbol.SoftmaxActivation(net, mode="channel")
    up4 = syms.upsample(net, name="up4", scale=2, num_filter=class_num)
    up5 = syms.upsample(up4, name="up5", scale=2, num_filter=class_num)
    return up5

class RefineNet(object):
    def __init__(self):
        self.ctx = [mx.gpu(int(0))]
        seg_net_prefix = os.path.join(SNAPSHOT_FOLDER, MODEL_NAME)
        self.arg_dict, self.aux_dict, _ = misc.load_checkpoint(seg_net_prefix, self.epoch, load_symbol=False)

        seg_net = create_infer(CLASSNUM, WORKSPACE)
        self.mod = mx.module.Module(seg_net, data_names=('data',), label_names=(), context=self.ctx)
        self.mod.bind(data_shapes=[("data", (1, 3, 640, 640))], for_training=False, grad_req='null')
        self.mod.init_params(arg_params=self.arg_dict, aux_params=self.aux_dict, allow_missing=True)

       self.classification_service = rospy.Service('/mxnet_classification', rgbd_object_proposal, self.execute)

    def execute(self, req):
        im = self.bridge.imgmsg_to_cv2(req.color, desired_encoding="passthrough")
        t0 = datetime.datetime.now()
        self.mod.reshape([("data", im.shape)])
        t1 = datetime.datetime.now()
        print('module reshape took %s' % (t1 - t0))

        t0 = datetime.datetime.now()
        self.mod.bind(data_shapes=[("data", (1, 3, 640, 640))], for_training=False, grad_req='null')
        t1 = datetime.datetime.now()
        print('module reset bind took %s' % (t1 - t0))

        t0 = datetime.datetime.now()
        it = mx.io.NDArrayIter(data = [mx.nd.array(im)])
        t1 = datetime.datetime.now()
        print('create data iter took %s' % (t1 - t0))

        t0 = datetime.datetime.now()
        pred = self.mod.predict(it)[0]
        t1 = datetime.datetime.now()
        print('module predict took %s' % (t1 - t0))
        pred = pred.asnumpy().squeeze()

What have you tried to solve it?

Recreating the entire module object every time solves it but introduces about a second of computation time that should be unnecessary. I've narrowed it down to a call to mod.predict(), mod.forward(), or mod.bind(force_reset = True).

srawat commented 6 years ago

Were you able to find the cause of the issue?