ShaoqingRen / faster_rcnn

Faster R-CNN
Other
2.7k stars 1.22k forks source link

What these parameters mean in the source code? #136

Open hunterlew opened 7 years ago

hunterlew commented 7 years ago

I am going to run the full training procedure but found some parameters difficult to understand. Firstly, in the file /models/fast_rcnn_prototxts/ZF/train_val.prototxt, the 'spatial_scale' of 'roi_pool5' is set to 0.0625(1/16). What does that mean? Why not 1/32 or others?

Secondly, in the file /models/rpn_prototxts/ZF/train_val.prototxt, the input dim 15 is said to be the 'size for 224 input image, to be changed on-the-fly to match input dim'. What does that mean and how to set it if my input image is 1281281?

Thirdly, what does 'feat_stride' mean in the source code and how to specify it in my own network?

Hopefully someone address it and welcome anybody who has train faster-rcnn on your own dataset and pre-trained model.

thaiat commented 7 years ago

@hunterlew Did you get any answers, i have the same questions ?

adepierre commented 7 years ago

This is a bit old, but as Google brought me here when I had the same questions, here is what I have understood.

The 'feat_stride' parameter is the total stride of the whole net. For example if you have a net with two convolution layers with a stride of 2, one pooling layer with a stride of 3 and one last convolution layer with stride 1, the feat_stride is 12 (2 2 3 * 1).

It is easier to understand it as "two neighbours pixels in the output features map are (roughly) feat_stride pixels apart in input image".

So knowing that, it's easy to figure out what 'spatial_scale' is : it is simply 1/feat_stride. ROI pooling layer needs that parameter to be able to convert the input ROIs in image coordinates to ROIs in features map coordinates.

And for your second point, the 'input dim 15' is the size of the features map after the convolution part of the network. The easiest way to get it is to create a test network (so its only input is the image), to reshape its input to your size (128x128 for example) and then to call net.reshape() and get the shape of the features map at the end of the convolutions. This shape is the one you have to set for the inputs of your training network.

Hope this helps !