Closed gzpan closed 7 years ago
In their paper, Bertinetto mentions something like
Images are scaled such that the bounding box, plus an added margin for context, has a fixed area.
My guess is that, if the target undergoes a major visual change, if not for the target itself, the neighbour context would provide clues about the existence of the target in that particular region thereby boosting the score map of that search window.
At least a small amount of context should be included so that the edges of the object's boundary can be detected.
Additionally, since we do not introduce any padding in the network (i.e. all convolutions are "valid" not "same"), it is necessary to include a large amount of context so that the receptive fields of pixels in the final feature map are distributed nicely over the target.
why the size of exemplar image is more than the true target? Obviously, this would bring in extra background in exemplar image besides the target in first frame. Would it influence the divergence of the siamese network? Why not just take the true target region in first frame as exemplar image?