chenyuntc / simple-faster-rcnn-pytorch

A simplified implemention of Faster R-CNN that replicate performance from origin paper
Other
3.98k stars 1.14k forks source link

step by step understanding approximate joint training method #192 #254

Open sanhai77 opened 1 year ago

sanhai77 commented 1 year ago

i don't understand exactly approximate joint training method. i know RPN and detector merged as a one network during training. the forward path is started pre trained conv network and pass from RPN and finally arrives to fast rcnn layers. loss is computed :

RPN classification loss + RPN regression loss + Detection classification loss + Detection bounding-box regression loss.

but where is it from the backpropagation path? is it from detector and RPN and finally pretrained convnet? in this case how derivation performed in decoder section in RPN? offcets produced with 1x1 reg-conv layer in RPN is translated to proposals in decoder.

m-evdokimov commented 1 year ago

In approximate joint training method you train both rpn and the detection head simultaneously. The point is that you don't pass gradients from the detection head to rpn. In that case you need to detach an output of rpn from a computational graph (simply rpn_output.detach() in pytorch) and pass in to the detection head. If you don't detach the output it becomes non-approximate joint training method.

sanhai77 commented 1 year ago

ok, we use rpn_output.detach(). but why? is it possible to derivate roi() w.r.t the coordinate?

d(roi(feature_map , Rois))/d{x1,y1,x2,y2} = exist?

i mean the crooping part of the roi pool.

d(feature_map [x1:x2 , y1:y2])/d{x1,y1,x2,y2} = exist?

m-evdokimov commented 1 year ago

ok, we use rpn_output.detach(). but why?

If rpn output is detached you don't propagate gradients from the detection head to rpn. In that way the detection head is just a function of crops (but not a whole input image and anchor boxes parameters), this is what the approximate joint method does. You can think about it as if you take your image dataset, extract and cache crops made by rpn once and then train the detection head on them.

is it possible to derivate roi() w.r.t the coordinate?

Yes, it's possible. The main reason, why the detection head and rpn in the paper were trained "separately" is lack of computational resources i assume. Nowadays we can train all parts of such models at the same time, which is intuitively better.

sanhai77 commented 1 year ago

I apologize for my many question. but i am confused and i cant give my answer during any research. but roi pooling involves non-differentiable operations like indexing(quantizing the coordinate(like 3.5) to integers(3)). However why we detaching the proposals, during backpropagation. how the gradients do flow from the detector back into the RPN and feature extraction network? i dont uderstand this is unnecessary detaching proposal when gradients cant be flowing from roi pooling layer to rpn head and automatically are stoped. on other hand unlike roi align, outputs of roi pooling has not directly related with coordinates(proposals). (Actually, I did not find a mathematically relation between roi_output and inputs(coordinates).) i.e mathematically relation beetwen roi-pool outputs and{x1,y1,x2,y2}. So again is not necessary detaching proposal when there is not relationship beetwen roi pooling output and coordinate inputs. if d(roi_pool_outputs)/d{x1,y1,x2,y2} are not even exist why we should detach the {x1,y1,x2,y2} to become constant??

i realy confused.

m-evdokimov commented 1 year ago

The trick is that in joint training method you don't get derivatives wrt coordinates from rpn.

Actually there are two ways to train faster rcnn: a) Train rpn and the detection head in separate way. Going back to the days when people mostly don't have enough computational resources to train both parts in parallel, the recipe was simple: train single rpn, then from training data you extract crops, predicted by pretrained rpn. On the extracted crops you finally train the detection head. The method when you detach the rpn output is just the way to simulate separate training of the both parts in a single forward-backward step. b) Train all parts of the model at the same time. In that method the detection head output becomes a function of the input image (comparing to the method a, where you have two separate functions wrt to the input image and crops of the feature map). In a part of the model where you make crops from the rpn output you don't take gradients wrt to the coordinates of the crops. You can think about this operation as a simple element-wise multiplication of feature map and a binary mask where 1 represents the pixels of the crop. This trick makes gradient flow from the detection head to the rpn.