17 ICCV| (Oral) Representation Learning by Learning to Count

XFeiF commented 4 years ago

Mehdi Noroozi, Hamed Pirsiavash, Paolo Favaro, University of Bern, University of Maryland, Baltimore County
[paper] && [code]

Main idea:
The authors relate transformations of images to transformations of the representations. Specifically, they use counting as a pretext task, which they formalize as a constraint that release the "counted" visual primitives in tiles of an image to those counted in its downsampled version.

The downsampling or scaling transformation exploits the fact that the number of visual primitives should be invariant to scale.

The tiling transformation allows equating the total number of visual primitives in each tile to that in the whole image.

XFeiF commented 4 years ago

Here, the word 'primitive' means entity or instance. 'Counting' visual primitives may means counting how many visual entities in an image.
But here, in this paper, their focus is not make the network to learn how to count, instead, they predefine counting relationship which means having the same number of visual primitives' between two images as supervision signals.

XFeiF commented 4 years ago

D: downsampling operator with a downsampling factor of 2
T_j: tiling operator , where j = 1, . . . , 4, which extracts the j−th tile from a 2 × 2 grid of tiles.
Incase the network learns nothing but zero (experiments checked), they add a least effort bias as a penalization! M is a hyper-parameter, which is set to 10.

XFeiF / ComputerVision_PaperNotes

17 ICCV| (Oral) Representation Learning by Learning to Count #14