Closed YJHMITWEB closed 6 years ago
Hi, @YJHMITWEB IMHO, object occlusion is a big problem in object detection and segmentation, even for the fully supervised systems. In this work, we focused on extracting instance-aware visual cues from classification networks that trained with image-level labels, i.e., the presence of object categories. Due to the lack of rich, instance-level supervision, it is quite challenging for CNNs to understand what is an object instance. So we leveraged the semantic representations learned by the network classifier to obtain fine-detailed cues that highlight important regions of instances, such as of instance boundaries. To do so, the peak stimulation process, which focuses the network to learn from informative receptive fields estimated via class peak responses, is proposed to handle the 'feature messed up' case as you mentioned. And the peak backpropagation is proposed to decode the representation corresponding to each class peak response. As I said, the ground truth does not provide instance-level knowledge, thus in most cases, PRMs highlight discriminative regions of instances. To produce the final prediction, i.e., instance mask, we leveraged the spatial continuous prior information from segment proposals off-the-shelf. In fact, there are many ways to utilize the instance-aware cues extracted by the PRM technique. In our follow-up work, we found that the rich information from Video/RGBD can be exploited to substantially improve the mask generation step. Please refer to our paper and poster for more details. As for the inference speed, the extraction of visual cues can be done quickly on GPU, while proposal retrieval requires CPU and therefore slower. In general, it takes about 1-2 seconds to process a 448x448 image. For more questions and discussions, please feel free to ping me at yzhou.work@outlook.com. Thanks!
Thanks a lot for your quick responce. Since I am doing a related work, I'm very excited to appreciate and learn from your work. And yes, object occlusion is challenging even under fully supervised model, especially when a group of people standing side by side. I've tried different kinds of Convolution kernel, like atrous with different rates. Althouh they work well in semantic segmentation where you only need to distinguish classes instead of instances, they fail in the one stage instance segmentation model. According to my experiment, they are not capable of distinguish instances of a same class, so the result is still sort of like semantic segmentation. Also, I've noticed that in COCO dataset, compared to PASCAL VOC, there are more instances in small size, and without classical anchor mechanism used in object detection, the CNN backbone is prone to ignore their features, making it almost impossible for tiny objects' segmentation. And finally, you said that the proposal retrieval could only be done on CPU, though I don't know how it works in your model, but in my case, GPU could handle it with generating a very big square matrix (of course comsuming a large amount of memory) and maybe you don't have to do it on CPU. So, yeah, thanks again!
A quick question: any suggestion on the proposal retrieval method? I checked MCG and COB, both require segmentation ground truth for retraining. So if I want to generate proposals on a completely new dataset without gt seg masks, is there any good way of doing it?
Hi @sklin93 , These methods are trained using class-agnostic contour information rather than segmentation masks, and they are generally good generalizers for data in the same domain. Besides, spatial continuity prior does not have to be in the form of segment proposals. You can also construct data cost based on PRM and generate masks through graph cut methods.
Your works is amazing! Still, I'm puzzled about some details of the peak back-propagation process in your paper. In the paper, a few demo results of your model show a promising performance that PRM could work well on most cases where objects in the image are not so dense. I'm wondering that what if the input image is crowded with objects, for example a group of people or a traffic jam. The reason I propose this issue is that as far as I'm concerned, there will be many problems when it encountering this kind of situations. First, since the model extracts peaks of the feature map (which called Peak stimulation), if many objects are overlapping or very close to each other, then it is very likely of your model being not able to locate precise peaks of each of them, because their features are probably messed up with each other. Second, I've noticed that in paper, some results of peak back-propagation are shown, where different colors are used and making it fancy. My question is, according to the results, peak back-propagation could only recover part of objects, no matter what the object is, in paper, there are two kids playing with a car, and none of them are fully recovered after peak back-propagation. So, how do they turn into the final result? I know you are very busy, sorry for this long issue. My last question is, what is the speed (frame per second) of the whole model during inference?