Open FCInter opened 5 years ago
Thank you for submitting the issue! I'm labeling it so the MXNet community members can help resolve it.
@mxnet-label-bot add [Python]
Can you try using self._mod.set_params (https://mxnet.incubator.apache.org/api/python/module/module.html#mxnet.module.BaseModule.set_params) instead of self._mod.init_params?
@zachgk I have tried. It does not work.
Description
I trained a model and used it to perform prediction. While building the predictor, if I set the argument for_training=False, the prediction result is bad, as bad as predicted using a randomly initialized model.
Environment info (Required)
Package used (Python/R/Scala/Julia): Python
Build info (Required if built from source)
Compiler (gcc/clang/mingw/visual studio): gcc (Ubuntu 5.4.0-6ubuntu1~16.04.11) 5.4.0 20160609
Build config: I use pip install.
Error Message:
No error message. The problem happens to the prediction results. Please go to the Minimum reproducible example.
Minimum reproducible example
This is how I built the predictor:
The problem happens to second last line, i.e.
self._mod.bind(provide_data, provide_label, for_training=True)
. If I set thefor_training=True
, the prediction results look good. But if I setfor_training=False
, the prediction results look quite bad. Not a single object was detected on all hundreds of test images. It looks like if I setfor_training=False
, the model parameters are randomly initialized, instead of initialized from the saved.params
file.What have you tried to solve it?
I tried to get the parameters after
init_params
, i.e. afterinit_params
, I useself._mod.save_params(tar_filename)
, save the params to files. I saved both the params offor_training=True
andfor_training=False
. Strangely, the two save params are exactly equal.I manage to locate the problem as follows.
My code is based on the project Deep Feature Flow for Video Recognition. The model is trained based on a pre-trained resnet checkpoint, and it is the resnet checkpoint that cause the problem.
Specifically, I use the resnet-50 as an example. I get two codes and two checkpoints for resnet-50. I test them in the same project, but the results are very different.
The first checkpoint is converted from caffe using the converter. While the second checkpoint is downloaded from here.
I find that, the first checkpoint does not have the problem about
for_training
, i.e. matter setting it to true or false, the prediction results are the same, and look good. But the second checkpoint has thefor_training
problem. Why does this happen? They are the same model, and they can both get good prediction results. But why does the second one require thefor_training
be set toTrue
? This is also counter-intuitive in that, if we are doing prediction, shouldn't we setfor_training
toFalse
?I can also provide the mxnet graph code of the two checkpoints, if necessary. I don't copy them here because they are too long.
I'm sincerely looking forward to any help. This is very very important to my current work.
Thank you all for helping me!!!