apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more
https://mxnet.apache.org
Apache License 2.0
20.77k stars 6.8k forks source link

Unable to load vgg16 using mx.model.load_checkpoint #19597

Closed harshitshrma closed 3 years ago

harshitshrma commented 3 years ago

Hi all,

I am trying to load vgg16 pre-trained model I download from the mxnet website: http://data.mxnet.io/models/imagenet/vgg/vgg16-0000.params http://data.mxnet.io/models/imagenet/vgg/vgg16-symbol.json However, I keep getting the following error at line (symbol, argParams, auxParams) = mx.model.load_checkpoint('vgg16', 0):

mxnet.base.MXNetError: MXNetError: Failed loading Op prob of type SoftmaxOutput: [20:27:58] /home/user/mxnet/3rdparty/tvm/nnvm/src/core/op.cc:73: Check failed: op != nullptr: Operator SoftmaxOutput is not registered**

Can someone help me fix this issue?

(I am using MXNet 1.7.0)

github-actions[bot] commented 3 years ago

Welcome to Apache MXNet (incubating)! We are on a mission to democratize AI, and we are glad that you are contributing to it by opening this issue. Please make sure to include all the relevant context, and one of the @apache/mxnet-committers will be here shortly. If you are interested in contributing to our project, let us know! Also, be sure to check out our guide on contributing to MXNet and our development guides wiki.

samskalicky commented 3 years ago

@harshitshrma can you list the step-by-step instructions to reproduce? (ie. did you build from source, or install from pip wheel, exact code to load the model, etc..)

harshitshrma commented 3 years ago

@samskalicky I built from the source and then created a sym-link mxnet to my virtual environment on Ubuntu 20.04. I am getting the error while loading vgg16 from disk using the provided weight and symbol files:

Code:

import mxnet as mx
(symbol, argParams, auxParams) = mx.model.load_checkpoint("vgg16", 0)

Error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/harshit/.virtualenvs/dl4cv/lib/python3.8/site-packages/mxnet/model.py", line 262, in load_checkpoint
    symbol = sym.load('%s-symbol.json' % prefix)
  File "/home/harshit/.virtualenvs/dl4cv/lib/python3.8/site-packages/mxnet/symbol/symbol.py", line 2820, in load
    check_call(_LIB.MXSymbolCreateFromFile(c_str(fname), ctypes.byref(handle)))
  File "/home/harshit/.virtualenvs/dl4cv/lib/python3.8/site-packages/mxnet/base.py", line 246, in check_call
    raise get_last_ffi_error()
mxnet.base.MXNetError: MXNetError: Failed loading Op prob of type SoftmaxOutput: [11:21:10] /home/harshit/mxnet/3rdparty/tvm/nnvm/src/core/op.cc:73: Check failed: op != nullptr: Operator SoftmaxOutput is not registered
samskalicky commented 3 years ago
mxnet.base.MXNetError: MXNetError: Failed loading Op prob of type SoftmaxOutput: [11:21:10] /home/harshit/mxnet/3rdparty/tvm/nnvm/src/core/op.cc:73: Check failed: op != nullptr: Operator SoftmaxOutput is not registered

This error means the operator isnt registered. How are you building MXnet from source. Can you give step by step instructions to reproduce (ie. clone, make/cmake, PYTHONPATH, etc)

Is this error unique to your custom build? or does it fail when you use the publicly available pre-built pip wheels too?

harshitshrma commented 3 years ago

@samskalicky sure. These are the steps I followed.

$ git clone --recursive https://github.com/apache/incubator-mxnet.git mxnet
$ cp config/linux_gpu.cmake config.cmake
$ mkdir build
$ cd build
$ cmake ..
$ cmake --build . --parallel 4
$ cd ~/.virtualenvs/dl4cv/lib/python3.8/site-packages/
$ ln -s ~/mxnet/python/mxnet mxnet

I haven't tried with the pre-built pip wheel yet. But I did try a few older versions of MXNet building them from source and got the same error.

samskalicky commented 3 years ago

@harshitshrma it looks like after the clone you build immediately. This means you're building MXNet's master branch, rather than 1.7. Did you mean to do a git checkout v1.7.0 after the clone?

harshitshrma commented 3 years ago

@samskalicky Thanks a lot. You're right. I had forgotten to checkout v1.7.0. It works now (though I had to downgrade cuda toolkit from version 11.1 to 10.2).

samskalicky commented 3 years ago

Cool, thanks for the update. We added support for CUDA 11 in v1.8.x branch, you can try that one out if you need a newer version of CUDA.

harshitshrma commented 3 years ago

@samskalicky Sure, I'll try that. Thanks again.