AlbertoSabater / Robust-and-efficient-post-processing-for-video-object-detection

GNU General Public License v3.0
147 stars 20 forks source link

Appearance model details #11

Closed shardulparab97 closed 3 years ago

shardulparab97 commented 3 years ago

@AlbertoSabater I am using scaled yolo v4 in my current project. I have already implemented the version with appearance_matching set as False for Scaled Yolo V4 and am getting a good improvement in results. But there are cases of false positives, mixing of different objects in one tubelets at times, etc. In short, facing the exact problems which can be resolved with the help of the appearance/embedding model. Hence, want to dedicate time to building the embedding model for the same.

It would be really great if you could provide an idea/structure of the embedding model and the manner in which feature maps are pulled from a base model.

Thank you.

AlbertoSabater commented 3 years ago

Hi! The embedding model implemented in the paper is composed of just an RoI Pooling Layer and a single Fully Connected Layer. Once the main object detector model is trained, it has to extract for each image the set of bounding boxes and a set of feature maps from its backbone. Then, for each predicted bounding box, you have to extract from the feature maps the features related to it. The embedding model will resize these feature patches to a fixed size and will output the final embedding descriptor.

The embedding model must be trained once the main object detector is trained. To do so you have to create a dataset of feature patches for all the samples within your dataset. Then you can train the embedding model with a triplet loss. Better results (not tested) could be achieved by training with semi-hard triplet loss or Nt-Xent loss.

More details about the hyperparameters and the implementation details can be found both in this repo and in the REPP paper.

shardulparab97 commented 3 years ago

@AlbertoSabater Thank you for the detailed reply. In continuation to the above question, would like to understand the reason behind choosing the output of the block which downsamples the image size by 16 for RoiPooling.

For context, in the case of Scaled Yolo V4-P6 (the model which I am currently using), there are six downsampling blocks i.e. after passing through the last block, the image will be downsampled by 64 (instead of 32 in the case of YoloV3 since YoloV3 has 5 downsampling blocks). Hence, would like to ask whether it is feasible to pick the second last downsampling block or is preferred to pick the output of the block which downsamples the image by 16?

Thank you.

AlbertoSabater commented 3 years ago

In my case of Yolov3, the x16 downsampling was chosen empirically. The intuition behind this decision is the kind of features we are getting after each conv. block. Early feature maps encode too rough appearance information and ending ones encode too semantic information. For the purposes of REPP, too semantic information would not help to discern between objects of the same category, and too rough appearance information would not encode yet meaningful object information. So, the intermediate feature maps would be the ones that best work for REPP.

I haven't worked with Yolov4 but I suggest you use the same intuition. Probably the feature maps from the 4-5 (x16-32 downsampling) blocks could be a good choice.