Error while training the code for fcn32 model

apache / mxnet

Lightweight, Portable, Flexible Distributed/Mobile Deep Learning with Dynamic, Mutation-aware Dataflow Dep Scheduler; for Python, R, Julia, Scala, Go, Javascript and more

https://mxnet.apache.org

Apache License 2.0

20.73k stars 6.8k forks source link

Error while training the code for fcn32 model #1619

Closed Viswa14 closed 8 years ago

Viswa14 commented 8 years ago

File "fcn_xs.py", line 57, in main epoch_end_callback = mx.callback.do_checkpoint(fcnxs_model_prefix)) File "solver.py", line 72, in fit aux_states=self.aux_params) File "symbol.py", line 718, in bind args_handle, args = self._get_ndarray_inputs('args', args, listed_arguments, False) File "symbol.py", line 585, in _get_ndarray_inputs raise ValueError('Must specify all the arguments in %s' % arg_key) ValueError: Must specify all the arguments in args

I come across this error when i try to train the model for fcn32s using VGG_FC_ILSVRC_16_layers as Prefix. I believe the trained model provided for VGG16 does not have 'bigscore_bias'. Can anyone help with this regard ?

Viswa14 commented 8 years ago

Thank you for helping me out @Zhaw This helps to train the model successfully. But on testing using image_segmentaion.py by change appropriate model_prefix and epoch parameters the result I obtain is a black image. Can you provide an insight why that happens ? I am not sure Why all 0's are returned ? Is there any change in test code for other models ? This test code produces correct result for pre-trained FCN8s model provided by the author.

VALUES i get for data.shape, label.shape, out.shape and out_image are (1L, 3L, 335L, 500L) (1, 167500L) (1L, 21L, 335L, 500L) [[0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] ..., [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0] [0 0 0 ..., 0 0 0]]

@tornadomeet @tqchen : Kindly provide suggestions on this.

zhaw commented 8 years ago

What's your training accuracy? If your training accuracy is not low then I have no idea what could cause this problem. If your training accuracy is low and stays the same, this is probably because you set the learning rate too high and some parameters become NaN. This will makes your model predict all zero. I don't think you need to change anything in test code if you use your own model.

Viswa14 commented 8 years ago

My training accuracy comes around 69% stays same until 50 Epochs, I do not change anything in either training or testing code. I have the learning rate to be 1e-10, defined by the code in example.

zhaw commented 8 years ago

I think you should try higher learning rate. Fcn32s, 16s, 8s model need different learninig rate and 1e-10 is for training fcn8s model. If I remember correctly, learning rate I used for these three model is 1e-4, 1e-7, 1e-10.

Viswa14 commented 8 years ago

Okay. I did my trial based on information provided with the example. Do you suggest to change it according your arguements ? The learning rates provided along with examples are as follows: model lr (fixed) epoch fcn-32s 1e-10 31 fcn-16s 1e-12 27 fcn-8s 1e-14 19

zhaw commented 8 years ago

All I can suggest is to raise your learning rate, try different values and see which works. If your training accuracy stays same for a long time, your learning rate is too low. I'm not sure if my arguments will work for you because the original ones didn't either. I think the proper learning rate is related to the input image's size and this may be the reason why you need different learning rate training the same model.

tornadomeet commented 8 years ago

@Viswa14 due to update of mxnet in softmax operator currently, you should use samller lr as @zhaw suggested.

Viswa14 commented 8 years ago

@tornadomeet @zhaw : So you suggest a lower learning rate or higher learning rate ? Zhaw had suggested me to increase learning rate.

Viswa14 commented 8 years ago

And Thank you! Sure, I will try with it and provide an update. It will be great if the document can be modified too as it will support people trying out this example in future.

zhaw commented 8 years ago

Sorry, I think I misunderstood "My training accuracy comes around 69% stays same until 50 Epochs". I thought you meant that after 50 epochs your training accuracy increased. If your training accuracy stayed same all the time, that was because your learning rate was too high and some params turned to be NaN. If that was the case, you should lower your learning rate.

tornadomeet commented 8 years ago

yes, i made a mistake just a moment, just use larger lr.

zht3344 commented 7 years ago

@Viswa14 I obtain is a black image too using the fcn32s model, just like you ,should i lower my learning rate or increase the learning rate?(when i train the fcn32s model i use learning rate = 1e-10)