bertinetto / siamese-fc

Arbitrary object tracking at 50-100 FPS with Fully Convolutional Siamese networks.
http://www.robots.ox.ac.uk/~luca/siamese-fc.html
MIT License
619 stars 224 forks source link

Suffering from problems while implementing the algorithm #30

Open LinHungShi opened 7 years ago

LinHungShi commented 7 years ago

Hi, Luca Bertinetto. I am doing the similar thing and just found that you've done a great job in that regard. I've read the paper and watched the video (one of your demos is just what I am working on). As far as I know, one of advantage of your model is that you avoid repeated computation by using fully convolution. I am trying to incorporating your idea to my network, but confronted with some challenges. Therefore, I would like to ask for your opinions.

When we're tracking the object, we calculate the max scores and multiply by strides. The first problem is that we can't estimate scores for all locations in the next frame because some strides in the network are larger than 1, and thus some scores of locations can't be calculated. Do you solve this problem by upsampling score maps ? The second problem is that we will need a large search images as we increase the number of downsampling layers (either in conv layer or pooling layer). In the original paper, there are three layers with stride of 2, which results in the stride of 8. If we increase the stride of the network (e.g to 32), the search image will increase largely. Is there anyway we can prevent the network from suffering this problem?

bertinetto commented 7 years ago

Hi, Apologies for the late answer: I was at CVPR + holidays.

1) Yep we do upsample the score map during tracking. 2) Yes we want to limit the stride of the network to avoid reducing the spatial resolution too much. Do you need to use a pretrained network with a large stride? You can try just to upsample the activations, or instead you can train a head of the network which performs upsampling.

LinHungShi commented 7 years ago

Hi, thanks for reply. CVPR is really a great conference, hope you enjoyed your journey. If no bother, I have a few more questions in regard to the paper that need your help.

  1. In the paper, you talked about multiple scales, for example, "Multiple scales are searched in a single forward-pass by assembling a mini-batch of scaled images", "Tracking through scale space is achieved by processing several scaled versions of the search image. Any change in scale is penalized and updates of the current scale and damped" and "To handle scale variations, we also search for the object over five scales, and update the scale by linear interpolation with a factor of 0.35 to provide damping". I don't quite understand the meaning of "scale" in the context. Did you change the candidate/search image sizes? Could you explain the concept in more details?

  2. My thought of the sizes of exemplar and candidate images is that you extract 127x127 and 256x256 patches from the image. However, in Data Curation, you did image scaling. Did you scale both exemplar and candidate images? Since only scale factor s is specified, it means the area of scaled image will be 127*127 (area of exemplar image), but the width and height might be different. Could you give me a general procedure on how you process the images?

  3. After getting the score map, You upsample the it from 17x17 to 272x272. Since the candidate map is 256x256, which is smaller than the scaled score map, how do we know which score corresponds to which pixel ?

I'd really appreciate if you could give me some hints. Thanks.

hanjianglong commented 7 years ago

Hi, Luca Bertinetto. When I use the download imdb.mat training, mention this wrong "Reference to non-existent field 'id'.". What should I do?

bertinetto commented 7 years ago

Hi @LinHungShi, Sorry for the very late answer but I took a break from my PhD to do an internship at the moment. 1) The concept of scale is related to the size of the object in the previous frame. The update can be considered, for example: new_size = s*old_size . If s>1 then the object is increasing in size, if s<1 decreasing. At each frame we only search for 3 "scales", s=1, s=1.02 (or something similar) and s=1/1.02. 2) Not sure to understand what you are asking. Yes, all the images have been processed in the same way during data curation and we produced 2 crops of different size per frame. The procedure and the code are available in the ILSVRC15-curation folder. 3) The procedure to convert pixels in the response map to pixel in frame coordinates is detailed and documented in tracker_step.m

zsjerongdu commented 6 years ago

Hi, Luca Bertinetto. I also meet the same problem as hanjainglong. When I use the download imdb.mat training, mention this wrong "Reference to non-existent field 'id'.". Besides, I don't understand the use of save_crops.m,what kind of crops does it generate and what's the use of these crops? Would you please give me a hint?

shikongzxz commented 6 years ago

Hi, @bertinetto , I am afraid of that you have not clarified one of @LinHungShi's questions, i.e: the response map is 17x17, and the network's stride is 8, in your code disp_instanceInput = disp_instanceFinal * p.totalStride / p.responseUp, where the maximum value of disp_instanceInput is 68, which is much smaller than x_crop's half size 127, which means that object lies further than 68 pixels can not be detected.

Could you please explain this in detail, thanks?

shikongzxz commented 6 years ago

I guess I have figured it out myself. Apologies if any bother.

sysu-shey commented 5 years ago

Hi, Luca Bertinetto. When I use the download imdb.mat training, mention this wrong "Reference to non-existent field 'id'.". What should I do?