Predict result is bad when setting for_training=False

FCInter commented 5 years ago

Description

I trained a model and used it to perform prediction. While building the predictor, if I set the argument for_training=False, the prediction result is bad, as bad as predicted using a randomly initialized model.

Environment info (Required)

----------Python Info----------
('Version      :', '2.7.12')
('Compiler     :', 'GCC 5.4.0 20160609')
('Build        :', ('default', 'Dec  4 2017 14:50:18'))
('Arch         :', ('64bit', ''))
------------Pip Info-----------
('Version      :', '18.1')
('Directory    :', '/path/to/mx_env/local/lib/python2.7/site-packages/pip')
----------MXNet Info-----------
('Version      :', '1.3.0')
('Directory    :', '/path/to/mx_env/local/lib/python2.7/site-packages/mxnet')
('Commit Hash   :', 'b3be92f4a48bce62a5a8424271871c2f81c8f7f1')
----------System Info----------
('Platform     :', 'Linux-4.4.0-87-generic-x86_64-with-Ubuntu-16.04-xenial')
('system       :', 'Linux')
('node         :', 'B22-C09-G5500-01-GPU')
('release      :', '4.4.0-87-generic')
('version      :', '#110-Ubuntu SMP Tue Jul 18 12:55:35 UTC 2017')
----------Hardware Info----------
('machine      :', 'x86_64')
('processor    :', 'x86_64')
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                88
On-line CPU(s) list:   0-87
Thread(s) per core:    2
Core(s) per socket:    22
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 79
Model name:            Intel(R) Xeon(R) CPU E5-2699A v4 @ 2.40GHz
Stepping:              1
CPU MHz:               2400.093
CPU max MHz:           3600.0000
CPU min MHz:           1200.0000
BogoMIPS:              4801.21
Virtualization:        VT-x
Hypervisor vendor:     vertical
Virtualization type:   full
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              56320K
NUMA node0 CPU(s):     0-21,44-65
NUMA node1 CPU(s):     22-43,66-87

Package used (Python/R/Scala/Julia): Python

Build info (Required if built from source)

Compiler (gcc/clang/mingw/visual studio): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609

Build config: I use pip install.

Error Message:

No error message. The problem happens to the prediction results. Please go to the Minimum reproducible example.

Minimum reproducible example

This is how I built the predictor:

class Predictor(object):
    def __init__(self, symbol, data_names, label_names,
                 context=mx.cpu(), max_data_shapes=None,
                 provide_data=None, provide_label=None,
                 arg_params=None, aux_params=None):
        self._mod = MutableModule(symbol, data_names, label_names,
                                  context=context, max_data_shapes=max_data_shapes)
        self._mod.bind(provide_data, provide_label, for_training=True)
        self._mod.init_params(arg_params=arg_params, aux_params=aux_params)

The problem happens to second last line, i.e. self._mod.bind(provide_data, provide_label, for_training=True). If I set the for_training=True, the prediction results look good. But if I set for_training=False, the prediction results look quite bad. Not a single object was detected on all hundreds of test images. It looks like if I set for_training=False, the model parameters are randomly initialized, instead of initialized from the saved .params file.

What have you tried to solve it?

I tried to get the parameters after init_params, i.e. after init_params, I use self._mod.save_params(tar_filename), save the params to files. I saved both the params of for_training=True and for_training=False. Strangely, the two save params are exactly equal.

I manage to locate the problem as follows.

My code is based on the project Deep Feature Flow for Video Recognition. The model is trained based on a pre-trained resnet checkpoint, and it is the resnet checkpoint that cause the problem.

Specifically, I use the resnet-50 as an example. I get two codes and two checkpoints for resnet-50. I test them in the same project, but the results are very different.

The first checkpoint is converted from caffe using the converter. While the second checkpoint is downloaded from here.

I find that, the first checkpoint does not have the problem about for_training, i.e. matter setting it to true or false, the prediction results are the same, and look good. But the second checkpoint has the for_training problem. Why does this happen? They are the same model, and they can both get good prediction results. But why does the second one require the for_training be set to True? This is also counter-intuitive in that, if we are doing prediction, shouldn't we set for_training to False?

I can also provide the mxnet graph code of the two checkpoints, if necessary. I don't copy them here because they are too long.

I'm sincerely looking forward to any help. This is very very important to my current work.

Thank you all for helping me!!!

zachgk commented 5 years ago

Thank you for submitting the issue! I'm labeling it so the MXNet community members can help resolve it.

@mxnet-label-bot add [Python]

zachgk commented 5 years ago

Can you try using self._mod.set_params (https://mxnet.incubator.apache.org/api/python/module/module.html#mxnet.module.BaseModule.set_params) instead of self._mod.init_params?

FCInter commented 5 years ago

@zachgk I have tried. It does not work.

apache / mxnet